<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Data Cleaning in Python

The data we have collected may have several issues we need to identify:

* Are there missing values? How are they represented?
* Is the format of the table ready to be analyzed? Are there other elements not relevant but distracting or likely to confuse our work?
* Is every cell well written? are there characters that may not allow future analysis?

Let's check some data:

In [None]:
import pandas as pd

wikiLink="https://en.wikipedia.org/wiki/List_of_freedom_indices" 
freedomDFs=pd.read_html(wikiLink, flavor='bs4',attrs={'class':'wikitable sortable'})
len(freedomDFs)

Let's keep the first one:

In [None]:
freedom=freedomDFs[0].copy()
freedom.head()

## 1. Clean headers

In [None]:
# check headers
freedom.columns

Cleaning requires a strategy. In the strings above your main problem is the footnotes and the quasi-duplicates. 

In [None]:
# the quasi duplicates
ScoreColumns=freedom.columns[freedom.columns.str.contains('Scor')]
ScoreColumns

In [None]:
# the not quasi duplicates
freedom.columns[~freedom.columns.str.contains('Scor')]

In [None]:
# save the last one but the first one:
notScoreColumns=freedom.columns[~freedom.columns.str.contains('Scor')][1:]
notScoreColumns

Let's keep the last ones without the footnotes, let's _divide and conquer_ using **split()**:

In [None]:
# using list comprehension
[element.split('[') for element in notScoreColumns]

You see how I split each element, but the resulting list is not what you want, you need to keep the first element only:

In [None]:
# keeping first element [0]
[element.split('[')[0] for element in notScoreColumns]

This is not bad at all. However, a more efficient alternative is using **regular expressions**. There are books about this topic, but I will share some patterns that may prove useful.

In this situation, I want to:

* Get rid of footnotes.
* Get rid of the years.

Let's see:

In [None]:
import re  # a package to use regular expressions.

# one or more consecutive number \\d+
# anything that looks like \\[\\w+\\]
# using '|' as or
# using .strip() for unwanted spaces

pattern='\\d+|\\[\\w+\\]'
nothing=''

#substitute the 'pattern' by 'nothing':
[re.sub(pattern,nothing,element).strip() for element in notScoreColumns]

In [None]:
#save result
notScoreColumnsCleaner=[re.sub(pattern,nothing,element).strip() for element in notScoreColumns]

Let's create acronyms:

In [None]:
# split into list words
[nameCol.split() for nameCol in notScoreColumnsCleaner]

In [None]:
# first letter of each word as list
[[letter[0] for letter in nameCol.split()] for nameCol in notScoreColumnsCleaner]

In [None]:
# concatenate first letters in each list
["".join([letter[0] for letter in nameCol.split()]) for nameCol in notScoreColumnsCleaner]

We saved the acronyms:

In [None]:
acronyms=["".join([letter[0] for letter in nameCol.split()]) for nameCol in notScoreColumnsCleaner]
acronyms

We concatenate "score" to the acronyms into another list:

In [None]:
acronyms_score=[acro+'_score' for acro in acronyms]
acronyms_score

Let's rename:

In [None]:
change={old:new for old,new in zip(ScoreColumns,acronyms_score)}
change

In [None]:
change2={old:new for old,new in zip(notScoreColumns,acronyms)}
change2

In [None]:
change.update(change2)
change

In [None]:
freedom.rename(columns=change,inplace=True)
freedom.head()

## 2. Clean the data values.

Since there are categories, we could try some frequency tables:

In [None]:
freedom.FitW.value_counts(dropna=False).sort_index()

In [None]:
freedom.IoEF.value_counts(dropna=False).sort_index()

In [None]:
freedom.PFI.value_counts(dropna=False).sort_index()

In [None]:
freedom.DI.value_counts(dropna=False).sort_index()

The categories are well written.

Let's see the numeric columns. Let's identify cell that do not have numeric strings:


In [None]:
set(freedom.FitW_score[~freedom.FitW_score.str.contains('^\\d+\\.*\\d*$')])

Then, we can generalize:

In [None]:
badValues=[]
for col in acronyms_score:
    currentBad=freedom.loc[:,col][~freedom.loc[:,col].str.contains('^\\d+\\.*\\d*$')]
    badValues.extend(currentBad)

badValues=list(set(badValues))
badValues

We will need to replace those values with a proper missing value:

In [None]:
import numpy as np

freedom.replace(to_replace=badValues, value=np.nan,inplace=True)

In [None]:
freedom.info()

Let's keep the complete data:

In [None]:
freedom.dropna(how='any',axis=0,inplace=True, # keep complete
               ignore_index=True) # reset index

freedom

## 3. Check key column

In [None]:
freedom.Country

In [None]:
# to upper case and no trailing or leading spaces
freedom.Country.str.upper().str.strip()

In [None]:
freedom['Country']=freedom.Country.str.upper().str.strip()

## 4. Save output

In [None]:
freedom.to_csv('freedom_Py.csv',index=False)