<a href="https://colab.research.google.com/github/aliawofford9317/LSAMP_Python_Course2024/blob/Lucilla/Copy_of_Lesson_4a_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Quick Pandas review
In this section we will go over a Pandas example to refresh some key concepts.

This dataset is being pulled from a webpage using `read_csv` method.

This data sets consists of 3 different types of irises (a type of plant) (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
For more information on our dataset follow [this link](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

In [16]:
def fetch_data():
  import os, shutil
  cwd = os.getcwd()
  if os.path.exists("LSAMP_Python_Course2024"):
    shutil.rmtree("LSAMP_Python_Course2024")
  !git clone https://github.com/aliawofford9317/LSAMP_Python_Course2024.git
  for file in os.listdir("LSAMP_Python_Course2024"):
    if file.endswith((".txt",".csv")):
      shutil.copy("LSAMP_Python_Course2024/{}".format(file),cwd)
fetch_data()

Cloning into 'LSAMP_Python_Course2024'...
remote: Enumerating objects: 243, done.[K
remote: Counting objects: 100% (173/173), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 243 (delta 97), reused 122 (delta 65), pack-reused 70[K
Receiving objects: 100% (243/243), 2.59 MiB | 7.33 MiB/s, done.
Resolving deltas: 100% (133/133), done.


In [17]:
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


*Max and min values*

In [18]:
iris.max()

sepal_length          7.9
sepal_width           4.4
petal_length          6.9
petal_width           2.5
species         virginica
dtype: object

In [19]:
type(iris['sepal_length'].max())

numpy.float64

In [20]:
iris['sepal_length'].min()

4.3

In [21]:
# Look for duplicate values
iris['sepal_length'].duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: sepal_length, Length: 150, dtype: bool

We can group our data by some column we specify. Lets group by species

In [22]:
iris_group = iris.groupby(by='species')

In [23]:
iris_group.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


Or calculate the mean of a specific column

In [24]:
iris_group['petal_width'].mean()

species
setosa        0.246
versicolor    1.326
virginica     2.026
Name: petal_width, dtype: float64

Show petal width between 1.3 and 1.5

In [25]:
iris[(iris['petal_width'] >= 1.3) & (iris['petal_width'] <= 1.5)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
55,5.7,2.8,4.5,1.3,versicolor
58,6.6,2.9,4.6,1.3,versicolor
59,5.2,2.7,3.9,1.4,versicolor
61,5.9,3.0,4.2,1.5,versicolor
63,6.1,2.9,4.7,1.4,versicolor


## Data cleaning
Data cleaning is the process of dropping any null values, inconsistent or dirty data.
We can spend a lot of time cleaning datasets, so its really important to know how to do it properly. We will cover the following:
- Dropping unnecessary columns in a DataFrame
- Changing the index of a DataFrame
- Using .str() methods to clean columns
- Using the DataFrame.applymap() function to clean the entire dataset, element-wise
- Renaming columns to a more recognizable set of labels
- Skipping unnecessary rows in a CSV file

We will use 3 datasets for this example:
- `BL-Flickr-Images-Book.csv` – A CSV file containing information about books from the British Library
- `university_towns.txt` – A text file containing names of college towns in every US state
- `olympics.csv` – A CSV file summarizing the participation of all countries in the Summer and Winter Olympics

In [26]:
# lets import the required modules
import pandas as pd
import numpy as np

Often we will find columns that are useless for what we are trying to achieve. We will use the `drop` method to achieve this. We will pass a list of columns we want to delete to the method.

First lets open our data

In [27]:
df = pd.read_csv('BL-Flickr-Images-Book.csv')
df.head()

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.


You can see that a lot of this columns don't really provide the information on the books we are looking for. Lets drop the following columns `Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type, Shelfmarks`

we will pass a list of column names to our `drop` method with the columns we want to get rid of. We will also use the `inplace=True` parameter, so we don't have to create a new copy of the dataframe, and the `axis=1` so the drop is applied to columns and not rows (`axis=0`)

When we inspect our dataset we see that the unwanted columns have been removed.

In [28]:
to_drop = ['Edition Statement',
            'Corporate Author',
            'Corporate Contributors',
            'Former owner',
            'Engraver',
            'Contributors',
            'Issuance type',
            'Shelfmarks']

df.drop(to_drop, inplace=True, axis=1)

In [29]:
df.head()

Unnamed: 0,Identifier,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
0,206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
1,216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
2,218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
3,472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
4,480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


We could also use an alternative syntax to drop our columns.

In [31]:
df.drop(columns=to_drop, inplace=True) # will throwback error because we already ran this so it has no columns to drop and is an alternative way to do this

KeyError: "['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks'] not found in axis"

### Changing the index
Remember we can index our dataframes with any of the columns in the DF, this is versatile, but does not always guarantee unique indexes, which would be a problem if you have two books with the same name.

First lets check if our identifier column is unique with the `.is_unique` method. This will return a True or False on a specified column.

In [32]:
df['Identifier'].is_unique

True

Lets set our index to the `Identifier` column.

In [None]:
df = df.set_index('Identifier')
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


We can access each record using `loc[]` and an index number.

In [None]:
df.loc[480]

Place of Publication                                               London
Date of Publication                                                  1857
Publisher                                            Wertheim & Macintosh
Title                   [The World in which I live, and my place in it...
Author                                                          A., E. S.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 480, dtype: object

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

We could also use the `set_index()` method with the `inplace=True` parameter so our changes affect the dataframe directly and we don't have to create a new copy.

In [None]:
# Running this will result in an error...
# ...if index is already set to Identifier
df.set_index('Identifier', inplace=True)

KeyError: "None of ['Identifier'] are in the columns"

### Cleaning Datafields
In this section we will format our data so it has a more consisten format. We will be changing the `Date of Publication` and `Place of Publication`.

If we take a look at our `Date of Publication` we can see that we have `[]` square brackets, multiple publication dates. We will need to:
- Remove the extra dates in the square brackets `[1878]`.
- Convert dates to `start date` whenever present.
- Remove dates we are not certain about

In [None]:
 df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where `[` starts off the string.)

In [None]:
# Running our regex with the .str method
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head(20)

Identifier
206     1879
216     1868
218     1869
472     1851
480     1857
481     1875
519     1872
667      NaN
874     1676
1143    1679
1280    1802
1808    1859
1905    1888
1929    1839
2836    1897
2854    1865
2956    1860
2957    1873
3017    1866
3131    1899
Name: Date of Publication, dtype: object

This column still has object dtype, but we can easily get its numerical version with `pd.to_numeric`

In [None]:
df['Date of Publication'] = pd.to_numeric(extr)
df['Date of Publication'].dtype

dtype('float64')

This results in about one in every ten values being missing, which is a small price to pay for now being able to do computations on the remaining valid values:

In [None]:
df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

### `str` Methods with Numpy to Clean Columns

you may have noticed the use of `df['Date of Publication']``.str.` This attribute is a way to access speedy string operations in Pandas that largely mimic operations on native Python strings or compiled regular expressions, such as `.split()`, `.replace()`, and `.capitalize()`.

To clean the Place of Publication field, we can combine Pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax:

`np.where(condition, then, else)`

Essentially, `.where()` takes each element in the object used for condition, checks whether that particular element evaluates to True in the context of the condition, and returns an ndarray containing then or else, depending on which applies.

It can be nested into a compound if-then statement, allowing us to compute values based on multiple conditions:

` np.where(condition1, x1,
        np.where(condition2, x2,
            np.where(condition3, x3, ...)))`

In [None]:
# Check our place of publication column
df['Place of Publication'].head(10)

Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary information. If we were to look at more values, we would see that this is the case for only some rows that have their place of publication as ‘London’ or ‘Oxford’.

In [None]:
df.loc[4157862]

Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                1867.0
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                      FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

In [None]:
df.loc[4159587]

Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                1834.0
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                              Mackenzie, E. (Eneas)
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place while the other does not.

To clean this column in one sweep, we can use str.contains() to get a Boolean mask.

We clean the column as follows:

In [None]:
pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool

In [None]:
# Combine it with np.where
oxford = pub.str.contains('Oxford')

In [None]:
df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                                                 pub.str.replace('-', ' ')))

In [None]:
df['Place of Publication'].head()

Identifier
206    London
216    London
218    London
472    London
480    London
Name: Place of Publication, dtype: object

Here, the np.where function is called in a nested structure, with condition being a Series of Booleans obtained with str.contains(). The contains() method works similarly to the built-in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).

The replacement to be used is a string representing our desired place of publication. We also replace hyphens with a space with str.replace() and reassign to the column in our DataFrame.

Although there is more dirty data in this dataset, we will discuss only these two columns for now.

Let’s have a look at the first five entries, which look a lot crisper than when we started out:

In [None]:
df.head()

Unnamed: 0_level_0,Place of Publication,Date of Publication,Publisher,Title,Author,Flickr URL
Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
206,London,1879.0,S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,http://www.flickr.com/photos/britishlibrary/ta...
216,London,1868.0,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
218,London,1869.0,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.",http://www.flickr.com/photos/britishlibrary/ta...
472,London,1851.0,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...
480,London,1857.0,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.",http://www.flickr.com/photos/britishlibrary/ta...


### Using the `applymap` Function
In certain situations, you will see that the dirty data is not localized to one column but is more spread out.

There are some instances where it would be helpful to apply a customized function to each cell or element of a DataFrame. Pandas `.applymap()` method is similar to the in-built `map()` function and simply applies a function to all the elements in a DataFrame.

Let’s look at an example. We will create a DataFrame out of the `“university_towns.txt”` file:

we have periodic state names followed by the university towns in that state: StateA TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in the file, we’ll see that all of them have the “[edit]” substring in them.

We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that list in a DataFrame:

In [33]:
university_towns = []
>>> with open('university_towns.txt') as file:
...     for line in file:
...         if '[edit]' in line:
...             # Remember this `state` until the next is found
...             state = line
...         else:
...             # Otherwise, we have a city; keep `state` as last-seen
...             university_towns.append((state, line))

In [34]:
university_towns[:5]

[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. Pandas will take each element in the list and set State to the left value and RegionName to the right value.

The resulting DataFrame looks like this:

In [35]:
towns_df = pd.DataFrame(university_towns,
                         columns=['State', 'RegionName'])

In [36]:
towns_df.head()

Unnamed: 0,State,RegionName
0,Alabama[edit]\n,Auburn (Auburn University)[1]\n
1,Alabama[edit]\n,Florence (University of North Alabama)\n
2,Alabama[edit]\n,Jacksonville (Jacksonville State University)[2]\n
3,Alabama[edit]\n,Livingston (University of West Alabama)[2]\n
4,Alabama[edit]\n,Montevallo (University of Montevallo)[2]\n


While we could have cleaned these strings in the for loop above, Pandas makes it easy. We only need the state name and the town name and can remove everything else. While we could use Pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each element of the DataFrame.

We have been using the term element, but what exactly do we mean by it? Consider the following “toy” DataFrame:

|        0|           1|
|---|---|
|0|    Mock|     Dataset|
|1  |Python|     Pandas|
|2   | Real |    Python|
|3   |NumPy  |   Clean|

In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. Therefore, applymap() will apply a function to each of these independently. Let’s define that function:

In [37]:
def get_citystate(item):
     if ' (' in item:
        return item[:item.find(' (')]
     elif '[' in item:
        return item[:item.find('[')]
     else:
        return item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:

In [38]:
towns_df =  towns_df.applymap(get_citystate)

First, we define a Python function that takes an element from the DataFrame as its parameter. Inside the function, checks are performed to determine whether there’s a ( or [ in the element or not.

Depending on the check, values are returned accordingly by the function. Finally, the applymap() function is called on our object. Now the DataFrame is much neater:



In [39]:
towns_df.head(20)

Unnamed: 0,State,RegionName
0,Alabama,Auburn
1,Alabama,Florence
2,Alabama,Jacksonville
3,Alabama,Livingston
4,Alabama,Montevallo
5,Alabama,Troy
6,Alabama,Tuscaloosa
7,Alabama,Tuskegee
8,Alaska,Fairbanks
9,Arizona,Flagstaff


The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by the returned value.

### Renaming Columns and Skipping Rows
Often, the datasets you’ll work with will have either column names that are not easy to understand, or unimportant information in the first few and/or last rows, such as definitions of the terms in the dataset, or footnotes.

In that case, we’d want to rename columns and skip certain rows so that we can drill down to necessary information with correct and sensible labels.

To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of the “olympics.csv” dataset:

In [40]:
olympics_df = pd.read_csv('olympics.csv')
olympics_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined total
1,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
2,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
3,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
4,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12


This is messy indeed! The columns are the string form of integers indexed at 0. The row which should have been our header (i.e. the one to be used to set the column names) is at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.

Also, if we were to go to the [source](https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table) of this dataset, we’d see that NaN above should really be something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”, and so on.

Therefore, we need to do two things:

- Skip one row and set the header as the first (0-indexed) row
- Rename the columns

We can skip rows and set the header while reading the CSV file by passing some parameters to the read_csv() function.

This function takes a lot of optional parameters, but in this case we only need one (header) to remove the 0th row:

In [41]:
olympics_df = pd.read_csv('olympics.csv', header=1)
olympics_df.head()

Unnamed: 0.1,Unnamed: 0,? Summer,01 !,02 !,03 !,Total,? Winter,01 !.1,02 !.1,03 !.1,Total.1,? Games,01 !.2,02 !.2,03 !.2,Combined total
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


We now have the correct row set as the header and all unnecessary rows removed. Take note of how Pandas has changed the name of the column containing the name of the countries from NaN to Unnamed: 0.

To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel an axis based on a mapping (in this case, a dict).

Let’s start by defining a dictionary that maps current column names (as keys) to more usable ones (the dictionary’s values):

In [42]:
new_names =  {'Unnamed: 0': 'Country',
               '? Summer': 'Summer Olympics',
               '01 !': 'Gold',
               '02 !': 'Silver',
               '03 !': 'Bronze',
               '? Winter': 'Winter Olympics',
               '01 !.1': 'Gold.1',
               '02 !.1': 'Silver.1',
               '03 !.1': 'Bronze.1',
               '? Games': '# Games',
               '01 !.2': 'Gold.2',
               '02 !.2': 'Silver.2',
               '03 !.2': 'Bronze.2'}

In [43]:
# Call the rename function
olympics_df.rename(columns=new_names, inplace=True)
olympics_df.head()

Unnamed: 0,Country,Summer Olympics,Gold,Silver,Bronze,Total,Winter Olympics,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total
0,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
1,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
2,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
3,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


#### Practice Exercise
1. Using the `olympics` dataset, check if the `Country` column is unique. If it is unique set the column as the new index.
2. Check that the dataset does not contain null values.
3. Sort the countries by `Combined total` medals in descending order. Use `inplace=True`. You can use the `sort_values()` method.
4. Apply the `medal_count` function to the dataframe with `map()` over the `Combined total` column. The function is given in the following code block.

In [54]:
def medal_count(item):
    if item >= 100:
        return 'Lots of medals'
    if item < 100:
        return 'Few medals'

In [46]:
import pandas as pd

olympics_df = pd.read_csv('olympics.csv')

if 'Country' in olympics_df.columns and olympics_df['Country'].is_unique:
    olympics_df.set_index('Country', inplace=True)
else:
    print("'Country' column is either missing or not unique.")


'Country' column is either missing or not unique.


In [49]:
if olympics_df.isnull().sum().sum() == 0:
    print("The dataset does not contain null values.")
else:
    print("The dataset contains null values.")


The dataset contains null values.


In [51]:
if 'Combined total' in olympics_df.columns:
    olympics_df.sort_values(by='Combined total', ascending=False, inplace=True)
else:
    print("'Combined total' column is missing.")


'Combined total' column is missing.


In [58]:
def medal_count(combined_total):
    if combined_total > 100:
        return 'High'
    elif combined_total > 50:
        return 'Medium'
    else:
        return 'Low'
if 'Combined total' in olympics_df.columns:
    olympics_df['Medal Category'] = olympics_df['Combined total'].map(medal_count)
else:
    print("'Combined total' column is missing.")
def medal_count(item):
    if item >= 100:
        return 'Lots of medals'
    else:
        return 'Few medals'


'Combined total' column is missing.


In [59]:
df['Medal Category'] = df['Combined total'].map(medal_count)

print(df.head())

KeyError: 'Combined total'