# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#dataframe
COL_in_countries_df = pd.read_csv('Cost_of_Living_Index_by_Country_2024.csv')
COL_in_countries_df

Unnamed: 0,Rank,Country,Cost of Living Index,Rent Index,Cost of Living Plus Rent Index,Groceries Index,Restaurant Price Index,Local Purchasing Power Index
0,1,Switzerland,101.1,46.5,74.9,109.1,97.0,158.7
1,2,Bahamas,85.0,36.7,61.8,81.6,83.3,54.6
2,3,Iceland,83.0,39.2,62.0,88.4,86.8,120.3
3,4,Singapore,76.7,67.2,72.1,74.6,50.4,111.1
4,5,Barbados,76.6,19.0,48.9,80.8,69.4,43.5
...,...,...,...,...,...,...,...,...
116,117,Bangladesh,22.5,2.4,12.8,25.7,12.8,33.1
117,118,India,21.2,5.6,13.7,23.8,15.1,82.6
118,119,Egypt,21.0,3.7,12.7,21.2,16.2,20.0
119,120,Libya,20.4,4.3,12.7,22.2,15.2,42.0


In [2]:
COL_in_countries_df.shape

(121, 8)

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [3]:
no_val_df = COL_in_countries_df.isnull().sum()
no_val_df
#no missing values they would have showed up as number in correspondence

Rank                              0
Country                           0
Cost of Living Index              0
Rent Index                        0
Cost of Living Plus Rent Index    0
Groceries Index                   0
Restaurant Price Index            0
Local Purchasing Power Index      0
dtype: int64

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [4]:
COL_in_countries_df.describe()
#It doesn't look to be that there is any irregular data at this time that could affect my results

Unnamed: 0,Rank,Cost of Living Index,Rent Index,Cost of Living Plus Rent Index,Groceries Index,Restaurant Price Index,Local Purchasing Power Index
count,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,61.0,43.555372,16.052893,30.357851,44.228926,36.471074,65.094215
std,35.073732,16.147574,11.412267,13.263721,17.055109,18.25811,39.569094
min,1.0,18.8,2.4,11.1,17.5,12.8,2.3
25%,31.0,30.2,8.5,19.8,31.6,21.6,34.8
50%,61.0,39.5,12.4,27.0,40.5,33.1,50.6
75%,91.0,52.8,20.1,37.0,53.7,47.2,99.4
max,121.0,101.1,67.2,74.9,109.1,97.0,182.5


## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [6]:
#I am looking for the price of restaurants and cost of living plus rent index the groceries index and Local Purchasing Power 
# and would be deemed unnecessary as I assume if you can purchase restaurant goods and services they fall into a similar category not 
#exactly the same but for the purpose of deeming data unnecessary in this excercise

new_cost_df = COL_in_countries_df.drop(['Local Purchasing Power Index', 'Cost of Living Index','Rent Index'], axis=1)
new_cost_df

#If the Rent Index is 80, it suggests that the average rental prices in that city are approximately 20% lower than those in New York City

Unnamed: 0,Rank,Country,Cost of Living Plus Rent Index,Groceries Index,Restaurant Price Index
0,1,Switzerland,74.9,109.1,97.0
1,2,Bahamas,61.8,81.6,83.3
2,3,Iceland,62.0,88.4,86.8
3,4,Singapore,72.1,74.6,50.4
4,5,Barbados,48.9,80.8,69.4
...,...,...,...,...,...
116,117,Bangladesh,12.8,25.7,12.8
117,118,India,13.7,23.8,15.1
118,119,Egypt,12.7,21.2,16.2
119,120,Libya,12.7,22.2,15.2


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [7]:
head_rows = new_cost_df.head(10)
tail_rows = new_cost_df.tail(10)
#combined_rows = pd.concat([head_rows, tail_rows], keys=['Head', 'Tail'])
#combined_rows
# I didn't find any inconsistent data within the first 10 head and tail, I am still going to find the middle rows to see if there is not
# any inconsistent data, although it looks like the dataset is thorough

total_rows = len(new_cost_df)

#I have to get the 5 before and after to get the top 10 middle index

middle_index = total_rows // 2
start_index = middle_index - 5  
end_index = middle_index + 5

middle_rows = new_cost_df.iloc[start_index:end_index]
middle_rows

combined_rows = pd.concat([head_rows, middle_rows, tail_rows], keys=['Head', 'Middle', 'Tail'])
combined_rows

#There is not any inconsistent data in this set

Unnamed: 0,Unnamed: 1,Rank,Country,Cost of Living Plus Rent Index,Groceries Index,Restaurant Price Index
Head,0,1,Switzerland,74.9,109.1,97.0
Head,1,2,Bahamas,61.8,81.6,83.3
Head,2,3,Iceland,62.0,88.4,86.8
Head,3,4,Singapore,72.1,74.6,50.4
Head,4,5,Barbados,48.9,80.8,69.4
Head,5,6,Norway,52.1,79.0,73.5
Head,6,7,Denmark,50.2,64.8,81.3
Head,7,8,Hong Kong (China),65.3,84.6,46.2
Head,8,9,United States,56.6,75.0,67.2
Head,9,10,Australia,52.5,77.3,62.5


## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset? I found 2 out of the 4 types of dirty data.
2. Did the process of cleaning your data give you new insights into your dataset? Yes, it allowed me to sort through the data enough to decide what information I need.
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations? I'd like to make note to try more than one type of visualization to find the one that suits the data the best.