## Data Preparation Practice

#### Purpose: 

+ The purpose of this notebook is to practice the data gathering, assessment, and cleaning process using pandas.

#### The Dataset:

+ The dataset is the Chicago Red Light Camera dataset which can be found on kaggle [here](https://www.kaggle.com/chicago/chicago-red-light-and-speed-camera-data).

#### Import Libraries

In [1]:
# Import libraries necessary to unzip data and clean data.

import zipfile
import pandas as pd

### Gather Data

In [2]:
#Use zipfile to unzip data folder

with zipfile.ZipFile('chicago-red-light-and-speed-camera-data.zip','r') as myzip:
    myzip.extractall()

In [10]:
df = pd.read_csv('speed-camera-violations.csv')
df.head()

Unnamed: 0,ADDRESS,CAMERA ID,VIOLATION DATE,VIOLATIONS,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards
0,10318 S INDIANAPOLIS,CHI120,2019-07-04T00:00:00.000,196,1203645.0,1837056.0,41.707577,-87.529848,"{'human_address': '{""address"": """", ""city"": """",...",47.0,21202.0,49.0,705.0,47.0
1,1111 N HUMBOLDT,CHI010,2019-07-04T00:00:00.000,60,,,,,,,,,,
2,1142 W IRVING PARK,CHI095,2019-07-04T00:00:00.000,80,1167790.0,1926747.0,41.954541,-87.658573,"{'human_address': '{""address"": """", ""city"": """",...",37.0,21186.0,31.0,241.0,39.0
3,115 N OGDEN,CHI077,2019-07-04T00:00:00.000,47,1166485.0,1900735.0,41.883192,-87.664115,"{'human_address': '{""address"": """", ""city"": """",...",41.0,14917.0,29.0,63.0,46.0
4,1315 W GARFIELD BLVD,CHI121,2019-07-04T00:00:00.000,58,1168445.0,1868118.0,41.793645,-87.657861,"{'human_address': '{""address"": """", ""city"": """",...",19.0,22257.0,65.0,297.0,2.0


### Assess

In [13]:
# Check format of column names

df.columns

Index(['ADDRESS', 'CAMERA ID', 'VIOLATION DATE', 'VIOLATIONS', 'X COORDINATE',
       'Y COORDINATE', 'LATITUDE', 'LONGITUDE', 'LOCATION',
       'Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',
       'Census Tracts', 'Wards'],
      dtype='object')

In [11]:
# Check for null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193236 entries, 0 to 193235
Data columns (total 14 columns):
ADDRESS                       193236 non-null object
CAMERA ID                     193236 non-null object
VIOLATION DATE                193236 non-null object
VIOLATIONS                    193236 non-null int64
X COORDINATE                  185910 non-null float64
Y COORDINATE                  185910 non-null float64
LATITUDE                      185910 non-null float64
LONGITUDE                     185910 non-null float64
LOCATION                      185910 non-null object
Historical Wards 2003-2015    185910 non-null float64
Zip Codes                     185910 non-null float64
Community Areas               185910 non-null float64
Census Tracts                 185910 non-null float64
Wards                         185910 non-null float64
dtypes: float64(9), int64(1), object(4)
memory usage: 20.6+ MB


In [12]:
# Check for duplicate rows

sum(df.duplicated())

0

#### Observations: 

    + Mix of upper-case and lower-case column names.
    + Column names with spaces instead of underscores.
    + Approx 8,000 rows with null elements in all but 4 columns. 

### Cleaning


#### Define

+ Change all column names to lower case letters.
+ Substitute spaces in column names with lower underscores
+ Remove rows where majority of columns have null values.

#### Code

**Change all column names to lower case letters**

In [15]:
df.rename(columns= lambda x: x.lower(), inplace=True)
df.columns

Index(['address', 'camera id', 'violation date', 'violations', 'x coordinate',
       'y coordinate', 'latitude', 'longitude', 'location',
       'historical wards 2003-2015', 'zip codes', 'community areas',
       'census tracts', 'wards'],
      dtype='object')

**Substitute spaces in column headers with underscores**

In [17]:
df.rename(columns = lambda x: x.replace(' ','_'),inplace=True)
df.columns

Index(['address', 'camera_id', 'violation_date', 'violations', 'x_coordinate',
       'y_coordinate', 'latitude', 'longitude', 'location',
       'historical_wards_2003-2015', 'zip_codes', 'community_areas',
       'census_tracts', 'wards'],
      dtype='object')

**Remove rows with null values.**

In [18]:
df.dropna(axis = 0, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185910 entries, 0 to 193235
Data columns (total 14 columns):
address                       185910 non-null object
camera_id                     185910 non-null object
violation_date                185910 non-null object
violations                    185910 non-null int64
x_coordinate                  185910 non-null float64
y_coordinate                  185910 non-null float64
latitude                      185910 non-null float64
longitude                     185910 non-null float64
location                      185910 non-null object
historical_wards_2003-2015    185910 non-null float64
zip_codes                     185910 non-null float64
community_areas               185910 non-null float64
census_tracts                 185910 non-null float64
wards                         185910 non-null float64
dtypes: float64(9), int64(1), object(4)
memory usage: 21.3+ MB


#### Test

In [27]:
# Check that all column headers are lower case

for e in df.columns:
    assert e.islower()==True
print('All column headers are lower case')

All column headers are lower case


In [28]:
# Check that there are no spaces in column names

for e in df.columns:
    assert ' ' not in e
print('There are no spaces in column names')

There are no spaces in column names


In [36]:
# Check that there are no null values

for e in df.isnull().any():
    assert e ==False
print('There are no null values')




There are no null values


### Conclusions

In conclusion the following changes were made to this data set:

+ Change all column names to lower case letters.
+ Substitute spaces in column names with lower underscores
+ Remove rows where majority of columns have null values.

