## Task

In this task you will clean the country column and parse the date column in the **store_income_data_task.csv** file.

In [26]:
from datetime import datetime
import pandas as pd

# Load up store_income_data.csv

df = pd.read_csv('store_income_data_task.csv')

1. Take a look at all the unique values in the "country" column. Then, convert the column to lowercase and remove any trailing white spaces.

In [27]:
#I have done this in reverse order as to me it makes more sense
# after cleaning the country values check for the unique values to make the replacement list shorter
# I have not used fuzzywuzzy libray as I found a different way to do it that it was faster for me

# Removing (.),(/) and replace the white spaces to single space
df["country"] = df["country"].str.replace(r"[./]", "", regex=True).str.replace(r"\s+", " ", regex=True).str.strip().str.lower()

# Printing results
print(df.country.unique())
#df

['united states' 'britain' 'united kingdom' 'uk' 'sa' 'america' nan
 'england' '' 'united states of america' 's africasouth africa']


2. Note that there should only be three separate countries. Eliminate all variations, so that 'South Africa', 'United Kingdom' and 'United States' are the only three countries.

In [28]:
# Get the replacement values according the above country list
replacements = {
    'united states of america': 'United States',
    'america': 'United States',
    'united states': 'United States',
    'britain': 'United Kingdom',
    'united kingdom': 'United Kingdom',
    'england': 'United Kingdom',
    'uk': 'United Kingdom',
    'sa': 'South Africa',
    's africasouth africa': 'South Africa'
}

# Replace  the country values abd filling the NaN values with Unknown
df["country"] = df["country"].replace(replacements)
df['country'] = df['country'].fillna('Unknown')
df

Unnamed: 0,id,store_name,store_email,department,income,date_measured,country
0,1,"Cullen/Frost Bankers, Inc.",,Clothing,$54438554.24,4-2-2006,United States
1,2,Nordson Corporation,,Tools,$41744177.01,4-1-2006,United Kingdom
2,3,"Stag Industrial, Inc.",,Beauty,$36152340.34,12-9-2003,United States
3,4,FIRST REPUBLIC BANK,ecanadine3@fc2.com,Automotive,$8928350.04,8-5-2006,United Kingdom
4,5,Mercantile Bank Corporation,,Baby,$33552742.32,21-1-1973,United Kingdom
...,...,...,...,...,...,...,...
995,996,Columbia Sportswear Company,cschooleyrn@sohu.com,Automotive,$52593924.99,7-10-2005,South Africa
996,997,WisdomTree Interest Rate Hedged High Yield Bon...,,Electronics,$60473676.46,19-12-1990,United States
997,998,Tortoise Energy Infrastructure Corporation,cbeardshallrp@ow.ly,Health,$1697293.64,25-4-2009,United States
998,999,Qwest Corporation,,Beauty,$30091863.73,13-1-2011,United Kingdom


3. Create a new column called `days_ago` in the DataFrame that is a copy of the 'date_measured' column but instead it is a number that shows how many days ago it was measured from the current date. Note that the current date can be obtained using `datetime.date.today()`.

In [29]:
# Get the date_measured in a correct way to manipulate with the pandas function pd.to_datetime()
df['date_measured'] = pd.to_datetime(df['date_measured'], dayfirst=True, errors='coerce')

# Calculate days ago
df['days_ago'] = (datetime.today() - df['date_measured']).dt.days
df



Unnamed: 0,id,store_name,store_email,department,income,date_measured,country,days_ago
0,1,"Cullen/Frost Bankers, Inc.",,Clothing,$54438554.24,2006-02-04,United States,6926
1,2,Nordson Corporation,,Tools,$41744177.01,2006-01-04,United Kingdom,6957
2,3,"Stag Industrial, Inc.",,Beauty,$36152340.34,2003-09-12,United States,7802
3,4,FIRST REPUBLIC BANK,ecanadine3@fc2.com,Automotive,$8928350.04,2006-05-08,United Kingdom,6833
4,5,Mercantile Bank Corporation,,Baby,$33552742.32,1973-01-21,United Kingdom,18993
...,...,...,...,...,...,...,...,...
995,996,Columbia Sportswear Company,cschooleyrn@sohu.com,Automotive,$52593924.99,2005-10-07,South Africa,7046
996,997,WisdomTree Interest Rate Hedged High Yield Bon...,,Electronics,$60473676.46,1990-12-19,United States,12452
997,998,Tortoise Energy Infrastructure Corporation,cbeardshallrp@ow.ly,Health,$1697293.64,2009-04-25,United States,5750
998,999,Qwest Corporation,,Beauty,$30091863.73,2011-01-13,United Kingdom,5122
