# Data Exploration and Wrangling Assignment

The objective of this assignment is to evaluate your understanding of Data Exploration and Data Wrangling concepts, along with your ability to apply them practically using standard Python libraries.

In [47]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Reading dataset

You are given a CSV file. The file is located in the `data` directory under the name `data.csv`.

You are required to read it using the pandas library as a pandas dataframe.

In [56]:
# Listing files in the assignment for reference
import os

for dirname, dirnames, filenames in os.walk("."):
    # Remove hidden directories from the list of directories to walk
    dirnames[:] = [d for d in dirnames if not d.startswith('.')]
    
    for filename in filenames:
        # Check if the file has the desired extensions
        if filename.endswith(".csv") or filename.endswith(".ipynb"):
            print(os.path.join(dirname, filename))

./Assignment1.ipynb
./data/data.csv


In [57]:
filename = None
### BEGIN SOLUTION
filename = "./data/data.csv"
### END SOLUTION

df = None

### BEGIN SOLUTION
df = pd.read_csv(filename)
### END SOLUTION

df.head()

Unnamed: 0,ID,Name,Date of Birth,Age,Gender,Body Mass Index,Address,Weekly wage,Job status,Profession,Annual Income,Email address,Contact Number,Marital Status,# of dependants,Avg Monthly Expenditure
0,5S6B6BP,KEVIN ANDERSON,"January 27, 2018",6.0,F,28.84625,,8937,Employed,Teacher,227499,qhall@hotmail.com,3346312423325,Unmarried,4,484974
1,TM29EO2,NICHOLAS PERRY,"August 01, 1959",65.0,O,,"030 Carrillo Junction, Michelleton, Oregon, 96462",965,Unemployed,,279027,marissadavidson@pitts.com,7984249407157,,3,431570
2,MG1IJX8,steven cortez,"May 18, 1995",29.0,O,15.653214,"3497 James Islands Apt. 257, South Chase, Tenn...",8244,,,68767,gscott@yahoo.com,6789661925093,,3,5K
3,W994VAP,ROBIN COLLINS,12/29/1910,114.0,O,33.762516,"2918 Bradley Cove Apt. 471, Patelborough, Nort...",7001,Student,,10762,yvonne12@dean-holder.net,1482600466366,Married,6,
4,7ZLRS0W,alyssa clark dds,"July 19, 1943",81.0,,22.71221,"2839 Avila Greens Suite 307, Johnsonton, Misso...",7633,Employed,Engineer,384292,shawnjones@guzman.com,8028511970892,,3,314968


In [64]:
assert len(df) == 1000
assert set(df.columns) == {'ID', 'Name', 'Date of Birth', 'Age', 'Gender', 'Body Mass Index',
       'Address', 'Weekly wage', 'Job status', 'Profession', 'Annual Income',
       'Email address', 'Contact Number', 'Marital Status', '# of dependants',
       'Avg Monthly Expenditure'}
### BEGIN HIDDEN TESTS
# checking for particular samples at fixed indexes
assert df.iloc[0].Name == "KEVIN ANDERSON" 
assert df.iloc[448].Address == "38207 Hansen Locks, Torresfurt, Maine, 51038"

# assert types of columns
assert df['ID'].dtype == object
assert df['Name'].dtype == object
assert df['Date of Birth'].dtype == object
assert df['Age'].dtype == np.float64
assert df['Gender'].dtype == object
assert df['Body Mass Index'].dtype == np.float64
assert df['Address'].dtype == object
assert df['Weekly wage'].dtype == object
assert df['Job status'].dtype == object
assert df['Profession'].dtype == object
assert df['Annual Income'].dtype == object
assert df['Email address'].dtype == object
assert df['Contact Number'].dtype == object
assert df['Marital Status'].dtype == object
assert df['# of dependants'].dtype == object
assert df['Avg Monthly Expenditure'].dtype == object

### END HIDDEN TESTS

## Data Types of Columns
Let us checkout the data types of the columns.

In [66]:
df.dtypes

ID                          object
Name                        object
Date of Birth               object
Age                        float64
Gender                      object
Body Mass Index            float64
Address                     object
Weekly wage                 object
Job status                  object
Profession                  object
Annual Income               object
Email address               object
Contact Number              object
Marital Status              object
# of dependants             object
Avg Monthly Expenditure     object
dtype: object

### Observation

We see that most of the columns are read as `object` while Age and Body Mass Index have been read as 'float64'.

This indicates that pandas was unable to automatically assign appropriate data types to our columns.

Thus usually occurs when our data is dirty. Let us explore the different columns to find and fix the problematic records.

# Data Exploration

In this section, we will look at some of the columns and try to find columns with dirty data and clean them using some set rules.

