# Data Processing

## Imports

In [19]:
import pandas as pd

### Read the CSV file

In [20]:
data = pd.read_csv('../data/Test-data-analyst-soleadify.csv', delimiter=',')
data.head()

Unnamed: 0,Source 1 - full address,Source 1 - phone number,Source 1 - Region,Source 1 - City,Source 1 - Country,Source 3 - Website,Source 3 - Activity,Source 2 - Activity,Source 2 - Website,Source 2 - Country,...,Source 2 - Phone,Source 3 - Country,Source 3 - Region,Source 3 - City,Source 3 - Phone,Source 4 - Activity,Unnamed: 18,Source 5 - Activity,Unnamed: 20,Source 6 - Activity
0,32 BLUE SPRINGS RD NORTH YORK ON M6L2T3 CA,4168582406,on,north york,canada,safeelectricalsolutions.ca,Electrical & Wiring Contractors,Energy - Equipment & Supplies,safeelectricalsolutions.ca,,...,19057859000.0,canada,ontario,scarborough,14162370000.0,Electrical & Wiring Contractors,,,,Electric utility company
1,ON CA,0,on,,canada,auto-master.com,,,,,...,,united states,texas,houston,18559930000.0,,,,,
2,ON CA,0,on,,canada,,Bakeries & Desserts,,,,...,,,,,,Vintage Clothing Store,,Bakeries & Desserts,,


#### Check data

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34706 entries, 0 to 34705
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Source 1 - full address  34705 non-null  object 
 1   Source 1 - phone number  34706 non-null  int64  
 2   Source 1 - Region        30717 non-null  object 
 3   Source 1 - City          30284 non-null  object 
 4   Source 1 - Country       34706 non-null  object 
 5   Source 3 - Website       29169 non-null  object 
 6   Source 3 - Activity      29287 non-null  object 
 7   Source 2 - Activity      16320 non-null  object 
 8   Source 2 - Website       18986 non-null  object 
 9   Source 2 - Country       15564 non-null  object 
 10  Source 2 - Region        15564 non-null  object 
 11  Source 2 - City          15564 non-null  object 
 12  Source 2 - Phone         18433 non-null  object 
 13  Source 3 - Country       22952 non-null  object 
 14  Source 3 - Region     

#### Make sure that everything is labeled correctly 

In [22]:
data.columns 

Index(['Source 1 - full address', 'Source 1 - phone number',
       'Source 1 - Region', 'Source 1 - City', 'Source 1 - Country',
       'Source 3 - Website', 'Source 3 - Activity', 'Source 2 - Activity',
       'Source 2 - Website', 'Source 2 - Country', 'Source 2 - Region',
       'Source 2 - City', 'Source 2 - Phone', 'Source 3 - Country',
       'Source 3 - Region', 'Source 3 - City', 'Source 3 - Phone',
       'Source 4 - Activity', 'Unnamed: 18', 'Source 5 - Activity',
       'Unnamed: 20', 'Source 6 - Activity'],
      dtype='object')

#### Remove unnecessary data 

Two columns seem to have no name, and after checking the CSV files it seems that they are empty columns so we can safely remove them.

In [26]:
data.drop(['Unnamed: 18', 'Unnamed: 20'], axis=1, inplace=True)

Also the 'Source 6' seems to only have the 'businnes activity' and no other info, so I will remove that as well.

In [30]:
data.drop(['Source 6 - Activity'], axis=1, inplace=True)

In [32]:
data.head()

Unnamed: 0,Source 1 - full address,Source 1 - phone number,Source 1 - Region,Source 1 - City,Source 1 - Country,Source 3 - Website,Source 3 - Activity,Source 2 - Activity,Source 2 - Website,Source 2 - Country,Source 2 - Region,Source 2 - City,Source 2 - Phone,Source 3 - Country,Source 3 - Region,Source 3 - City,Source 3 - Phone,Source 4 - Activity,Source 5 - Activity
0,32 BLUE SPRINGS RD NORTH YORK ON M6L2T3 CA,4168582406,on,north york,canada,safeelectricalsolutions.ca,Electrical & Wiring Contractors,Energy - Equipment & Supplies,safeelectricalsolutions.ca,,,,19057859000,canada,ontario,scarborough,14162370000.0,Electrical & Wiring Contractors,
1,ON CA,0,on,,canada,auto-master.com,,,,,,,,united states,texas,houston,18559930000.0,,
2,ON CA,0,on,,canada,,Bakeries & Desserts,,,,,,,,,,,Vintage Clothing Store,Bakeries & Desserts
3,2000 TALBOT RD WINDSOR ON N9A6S4 CA,0,on,windsor,canada,stclaircollege.ca,,,stclaircollege.ca,canada,ontario,windsor,"+15199722727,+15199722739,+15199661656",canada,ontario,windsor,15199720000.0,,
4,5117 52 ST WABAMUN AB T0E2K0 CA,9059653927,ab,wabamun,canada,,Oil & Gas - Extraction & Distribution,,,,,,,,,,,,


#### Check for unique values 

In [24]:
data.nunique()

Source 1 - full address    27534
Source 1 - phone number    11969
Source 1 - Region             32
Source 1 - City             2682
Source 1 - Country             1
Source 3 - Website         23848
Source 3 - Activity          550
Source 2 - Activity          388
Source 2 - Website         14872
Source 2 - Country             1
Source 2 - Region             15
Source 2 - City             1083
Source 2 - Phone           15046
Source 3 - Country            81
Source 3 - Region            245
Source 3 - City             2082
Source 3 - Phone           18923
Source 4 - Activity          341
Unnamed: 18                    0
Source 5 - Activity          539
Unnamed: 20                    0
Source 6 - Activity         1650
dtype: int64

#### Match names withing the columns

For the 'Source 3 - Country' column it seems like the same country is spelled differently multiple times, so we need to match them 

### Check for null values

In [25]:
data.isnull().sum()

Source 1 - full address        1
Source 1 - phone number        0
Source 1 - Region           3989
Source 1 - City             4422
Source 1 - Country             0
Source 3 - Website          5537
Source 3 - Activity         5419
Source 2 - Activity        18386
Source 2 - Website         15720
Source 2 - Country         19142
Source 2 - Region          19142
Source 2 - City            19142
Source 2 - Phone           16273
Source 3 - Country         11754
Source 3 - Region          13660
Source 3 - City            13658
Source 3 - Phone           10933
Source 4 - Activity        17611
Unnamed: 18                34706
Source 5 - Activity        19455
Unnamed: 20                34706
Source 6 - Activity        17382
dtype: int64

### Handle null values

## Checking the best possible values 

### What make a value the best possible 