# Data Cleaning

This notebook is to follow along the Pythonic Data Cleaning with Pandas and NumPy here: 
https://realpython.com/python-data-cleaning-numpy-pandas/

I've included all of the datasets in the GitHub Repo where you will also find this activity. 

We’ll cover the following:

    - Dropping unnecessary columns in a DataFrame
    - Changing the index of a DataFrame
    - Using .str() methods to clean columns
    - Using the DataFrame.applymap() function to clean the entire dataset, element-wise
    - Renaming columns to a more recognizable set of labels
    - Skipping unnecessary rows in a CSV file


In [79]:
import numpy as np
import pandas as pd

In [80]:
df = pd.read_csv('BL-Flickr-Images-Book.csv')

In [81]:
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


## Dropping Columns in a DataFrame

When we look at the first five entries using the [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method, we can see that a handful of columns provide information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks. 

We can drop these columns from the dataframe using [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)

In [82]:
>>> to_drop = ['Edition Statement',
...            'Corporate Author',
...            'Corporate Contributors',
...            'Former owner',
...            'Engraver',
...            'Contributors',
...            'Issuance type',
...            'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

We can verify that our drops worked by again printing head. Or if we desire to see the last entries we can use [tail()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) instead. 

In [83]:
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


In [84]:
df.tail()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
8282,4158088,London,1838,,"The Parochial History of Cornwall, founded on,...","GIDDY, afterwards GILBERT, Davies.",http://www.flickr.com/photos/britishlibrary/ta...
8283,4158128,Derby,"1831, 32",M. Mozley & Son,The History and Gazetteer of the County of Der...,"GLOVER, Stephen - of Derby",http://www.flickr.com/photos/britishlibrary/ta...
8284,4159563,London,[1806]-22,T. Cadell and W. Davies,Magna Britannia; being a concise topographical...,"LYSONS, Daniel - M.A., F.R.S., and LYSONS (Sam...",http://www.flickr.com/photos/britishlibrary/ta...
8285,4159587,Newcastle upon Tyne,1834,Mackenzie & Dent,"An historical, topographical and descriptive v...","Mackenzie, E. (Eneas)",http://www.flickr.com/photos/britishlibrary/ta...
8286,4160339,London,1834-43,,Collectanea Topographica et Genealogica. [Firs...,,http://www.flickr.com/photos/britishlibrary/ta...


Here we dropped the columns along axis = 1, meaning we dropped columns. <br>
``` df.drop(to_drop, inplace=True, axis=1)```

We could  alternatively call drop() directly with columns, and pass in our list of columns to drop. 
This is perhaps even more readable. <br>
``` df.drop(columns=to_drop, inplace=True)```

## Changing the Index

An index allows you to uniquely identify a row in a dataframe. Think of it as an ID or identifier for that row. An index should be something unique, so that when you use it to identify a row, you don't get conflicts. We notice that in our dataframe above, there is an identifier column. We can test to see if all of the values in that column are unique using [is_unique()](https://pandas.pydata.org/docs/reference/api/pandas.Series.is_unique.html) to see if it is a good candidate for being an index. 

In [85]:
>>> df['Identifier'].is_unique

True

Since it is unique, we can then assign the identifier colun to be the index using [set_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html)

In [86]:
df = df.set_index('Identifier')
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


Notice that the leftmost column, which indicated the row number, is gone. 

We can locate a specific record in the dataframe now using [loc[]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html). Let's say we wanted to get the book with identifier 218. 

In [87]:
df.loc[218]

Place of Publication                                               London
Date of Publication                                                  1869
Publisher                                           Bradbury, Evans & Co.
Title                   Love the Avenger. By the author of “All for Gr...
Author                                                          A., A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 218, dtype: object

We can still find records by position using [iloc[]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html). Python is generally 0-indexed, so our first record is referenced by 0, the 2nd is position 1, etc. So we can find the same record by doing accessing the 2 position (the third entry in the frame)

In [88]:
df.iloc[2]

Place of Publication                                               London
Date of Publication                                                  1869
Publisher                                           Bradbury, Evans & Co.
Title                   Love the Avenger. By the author of “All for Gr...
Author                                                          A., A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 218, dtype: object

Notice that we re-assigned the df variable ```df = df.set_index('Identifier')``` this is because by default, when we do an operation on a dataframe like .set_index() the operation returns a copy of the dataframe, and doesn't affect the original. If we want to bypass that, we can run the method in-place, telling Python that we want to set the index on the original dataframe. Then, we don't have to do re-assigning. ```df.set_index('Identifier', inplace=True)``` There will be times when you want to manipulate the dataframe directly and other times when you don't (for example when you just want to peek at what an operation will do, or you are still an intermediate stage of cleaning. You'll want to use [inplace=True carefully](https://www.askpython.com/python-modules/pandas/inplace-true-parameter). 

## Tidying up Fields in the Data
So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the <strong> object </strong> dtype, which in our case is roughly analogous to <strong> str </strong> in native Python. 
One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road. 

In [89]:
df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

    - Remove the extra dates in square brackets, wherever present: 1879 [1878]
    - Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
    - Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]
    - Convert the string nan to NumPy’s NaN value


We can use something called a [regular expression](https://www.w3schools.com/python/python_regex.asp) to extract the publication date for us. 

In [90]:
regex = r'^(\d{4})'

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. 
The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. 
The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

Let's see how this extraction pays off. We will use [str.extract()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html) which allows us to extract strings that match our regex pattern in the Date of Publication column. 

In [91]:
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

In [92]:
extr.head()

Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object

Notice that the type for the date of publication is stiill dtype, meaning it is still basically a string. 
We can change the type for that column to a float by using [to_numeric()](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html)

In [93]:
df['Date of Publication'] = pd.to_numeric(extr)

In [94]:
df['Date of Publication'].dtype

dtype('float64')

After our conversion to floats, we have inevitably lost some rows of data, as some don't dates
conform to the date of publication standards (4 digits) we outlined in the regex. 
If we cared, we could look into why this is the case. We can calculate the number of rows in the dataframe that have a Null value for the date of publication like so:

In [95]:
df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

So a little over 1 in 10 rows have no date of publication. A large part of the role of any data scientist or analyst
is to clean the data. I challenge you to find 

#### Challenge 1: Find out what the reason is for the null values for date of publication. The answer may be simple.

## Combining str Methods with NumPy to Clean Columns

str.extract() that we used above was a [string operation](https://pandas.pydata.org/pandas-docs/stable/text.html),
which is quick way to act on str-like objects that are in our dataframes. 
There are other str operations like 
[replace()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html), [split()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html), [capitalize()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.capitalize.html)

To clean the Place of Publication field, we can combine pandas str methods with NumPy’s [np.where](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function. <br> This is where's general syntax:
```np.where(condition, then, else)```

Here, condition is either an array-like object or a Boolean mask. then is the value to be used if condition evaluates to True, and else is the value to be used otherwise. 

Essentially, .where() takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies. I will also note that where() can be nested. 

In [96]:
df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. 
If we were to look at more values, we would see that this is the case for only some rows 
that have their place of publication as <strong> ‘London’ </strong> or <strong> ‘Oxford’ </strong>

In [97]:
df.loc[4157862]

Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                1867.0
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [98]:
df.loc[4159587]

Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                1834.0
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                              Mackenzie, E. (Eneas)
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

You can see that these two entries have the same place of publication but in different formats. The first has dashes where there are supposed to be spaces. This is likely due to unconstrained data entry- there is no particular format that is required when they add a new book to the database. This is very common when dealing with data in the real world. You'll get stuff like this all the time. Another example: <br>
Chicago, IL <br>
Chicago <br>
Chicago, Illinois <br>
Addresses are particularly notorious for this sort of inconsistency, 
in fact there are whole essays and tools ([Ex 1](https://developers.google.com/maps/documentation/address-validation), [Ex 2](https://www.geocod.io/features/api/)) developed specifically about address validation for data analysis. 
Cleaning fields like this is the [grunt work](https://xkcd.com/1831/) of data analysts. We all hate it, but we must do it. 

To clean this column in one sweep, we can use str.contains() to get a Boolean mask.
We will do this to get the rows that contain the word London and Oxford. 

In [110]:
pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool

In [111]:
oxford = pub.str.contains('Oxford')

In [112]:
df['Place of Publication'] = np.where(london, 'London',
                                 np.where(oxford, 'Oxford',
                                    pub.str.replace('-', ' ')))

Let's see that Newcastle line again after applying the cleaning. You'll notice the dashes are gone. 

In [113]:
df.loc[4157862]

Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                1867.0
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

Let's break down what went on there. 

    london = pub.str.contains('London'): This line creates a boolean Series (london) that is True for entries in the 'Place of Publication' column containing the substring 'London' and False otherwise.

    oxford = pub.str.contains('Oxford'): Similar to the previous line, this creates a boolean Series (oxford) for entries containing the substring 'Oxford'.

    pub.str.replace('-', ' '): This part replaces all occurrences of hyphens ('-') in the 'Place of Publication' column with spaces. This operation is applied to the entire column.

    np.where(london, 'London', ... ): This is a numpy function that takes three arguments:
        The first argument (london) is a boolean condition (does our row have the word London in it?)
        The second argument is the value to be assigned where the condition is True. In this case, it's the string 'London'.
        The third argument is what happens where the condition is False. In this case, it's another np.where statement.

    np.where(oxford, 'Oxford', pub.str.replace('-', ' ')): This is another np.where statement nested within the previous one. It follows the same logic:
        If the 'oxford' condition is True (the row has the word Oxford in it) it assigns 'Oxford'.
        If the 'oxford' condition is False, it goes to the next part, which is the replacement of hyphens with spaces.

So, in summary, the code first checks if the entry contains 'London' and assigns 'London' if true. If false, it then checks if the entry contains 'Oxford' and assigns 'Oxford' if true. If neither condition is met, it replaces hyphens with spaces in the 'Place of Publication' column. This helps in cleaning up and standardizing the values in the column based on the specified conditions.

We could of course use a different approach, for example clean up the London rows by themselves, then clean up the Oxford rows by themselves, and finally replace all dashes with spaces in any entries, this is particular technique just wraps it up for us nicely in one swoop. This is the power of <strong> NumPy </strong>

## Cleaning the Entire Dataset Using the map() Function

Let's imagine that the dirty data isn't particularly localized to certain columns, like they were in our last example. 
Let's say I want to enforce certain standards across the entire dataset. We can do this using [map()](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.map.html). 
Notice that in the tutorial applymap() is used. This has since been deprecated and replaced by map(), which is what we use in this notebook.

We will create a dataframe from the <b> university_towns.txt </b> file. 
If we take a peek at the file, you will notice it follows generally the format: <br>
State[edit] <br>
Region(University) <br>
Region(University) <br>
Region(University) <br>

$ head univerisity_towns.txt  <br>
Alabama[edit]  <br>
Auburn (Auburn University)[1] <br>
Florence (University of North Alabama) <br>
Jacksonville (Jacksonville State University)[2] <br>
Livingston (University of West Alabama)[2] <br> 
Montevallo (University of Montevallo)[2] <br>
Troy (Troy University)[2] <br>
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] <br>
Tuskegee (Tuskegee University)[5] <br>
Alaska[edit] <br>


We will use this formatting to our advantage to build a dataframe. 

In [120]:
# Instantiate our university towns list
university_towns = []

In [122]:
# Open the file, read each line, if it has edit in it, it is a state, otherwise it is a university town.
with open('university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))

In [123]:
# Take a peek at our constructed list
university_towns[:5]

[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

In [124]:
# Import our list into a pandas dataframe with two columns
towns_df = pd.DataFrame(university_towns,
                         columns=['State', 'RegionName'])

In [125]:
# Take a peek at our pandas dataframe
towns_df.head()

Unnamed: 0,State,RegionName
0,Alabama[edit]\n,Auburn (Auburn University)[1]\n
1,Alabama[edit]\n,Florence (University of North Alabama)\n
2,Alabama[edit]\n,Jacksonville (Jacksonville State University)[2]\n
3,Alabama[edit]\n,Livingston (University of West Alabama)[2]\n
4,Alabama[edit]\n,Montevallo (University of Montevallo)[2]\n


We can now create a function we can call that we can apply to every element in the dataframe. 

In [126]:
def get_citystate(item):
    if ' (' in item:
        return item[:item.find(' (')]
    elif '[' in item:
        return item[:item.find('[')]
    else:
        return item

What does this method do?

 1.   if ' (' in item:: This condition checks if the string contains a substring ' ('. If it does, it means there is additional information enclosed in parentheses, possibly representing city and state information. If this condition is true, it executes the following:
        return item[:item.find(' (')]: This extracts the substring of item from the beginning of the string (item[:) up to the index where the substring ' (' starts (item.find(' (')). It essentially <b> removes the information enclosed in parentheses and returns the remaining part of the string </b>

2.    elif '[' in item:: If the first condition is false, this condition checks if the string contains a '[' character. If it does, it means there is additional information enclosed in square brackets. If this condition is true, it executes the following:
        return item[:item.find('[')]: Similar to the previous case, it extracts the substring from the beginning of the string up to the index where the '[' character starts. <b> It removes the information enclosed in square brackets and returns the remaining part of the string. </b>

3.    else:: If none of the above conditions are true, it means there are no parentheses or square brackets in the string. In this case, it simply returns the original string unchanged.

In summary, this function is designed to clean up and extract city and state information from a string that might contain additional details enclosed in parentheses or square brackets. It returns the cleaned version of the string by removing the additional information if present.

We can now apply this function to the entire dataframe using map()

In [129]:
towns_df =  towns_df.map(get_citystate)

In [130]:
towns_df.head()

Unnamed: 0,State,RegionName
0,Alabama,Auburn
1,Alabama,Florence
2,Alabama,Jacksonville
3,Alabama,Livingston
4,Alabama,Montevallo


You may ask: Why wouldn't I always take this approach instead of using NumPy all the time? <br> The answer is <b>cost</b>. <br> If you have millions, billions, or even possibly trillions of rows of data, applying a map to the entire dataframe can be very time and resource expensive. It may be better to apply the filter to single columns of data at a time, especially with NumPy which is more efficient than pandas map() calls. 

## Renaming Columns and Skipping Rows

Often, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes.

In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels.

To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:

$ head -n 5 Datasets/olympics.csv <br>
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 <br>
,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total <br>
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2 <br>
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15 <br>
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70 <br>


Next we can read it into a dataframe and take a peek that way.

In [135]:
olympics_df = pd.read_csv('olympics.csv')

In [136]:
olympics_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total
1,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
2,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
3,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
4,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12


The columns are the string form of integers indexed at 0. The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.

Also, if we were to go to the [source](https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table) of this dataset, we’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.

Therefore, we need to do two things:

    - Skip one row and set the header as the first (0-indexed) row
    - Rename the columns


We can skip rows and set the header while reading the CSV file by passing some parameters to the read_csv() function. We can set the header to be the next line by setting header=1 in the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) call. 

In [137]:
olympics_df = pd.read_csv('olympics.csv', header=1)

In [138]:
olympics_df.head()

Unnamed: 0.1,Unnamed: 0,? Summer,01 !,02 !,03 !,Total,? Winter,01 !.1,02 !.1,03 !.1,Total.1,? Games,01 !.2,02 !.2,03 !.2,Combined total
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


We can rename the colummns to something more sensical using pandas' [rename()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html), which takes a mapping (a dictionary of old values:new values) and renames the columns accordingly. 

In [139]:
new_names =  {'Unnamed: 0': 'Country',
              '? Summer': 'Summer Olympics',
              '01 !': 'Gold',
              '02 !': 'Silver',
              '03 !': 'Bronze',
              '? Winter': 'Winter Olympics',
              '01 !.1': 'Gold.1',
              '02 !.1': 'Silver.1',
              '03 !.1': 'Bronze.1',
              '? Games': '# Games',
              '01 !.2': 'Gold.2',
              '02 !.2': 'Silver.2',
              '03 !.2': 'Bronze.2'}

In [142]:
# Notice in this case that I want to change the underlying dataframe, as the original headers are junk, so I use inplace=True
olympics_df.rename(columns=new_names, inplace=True)

In [141]:
olympics_df.head()

Unnamed: 0,Country,Summer Olympics,Gold,Silver,Bronze,Total,Winter Olympics,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


Great. We have gotten our dataframe to use the second row as the header, and cleaned up the column names using rename()!