# Data Prep
---
### Summary
    1. Import Modules and Data
    2. Data Exploration
    3. Data Cleaning
        A. Drop Useless Columns
        B. Drop Null Values
        C. Rename Columns
    4. Export Clean Data

## 1. Import Modules and Data

In [53]:
import pandas as pd
import numpy as np
import datetime
from datetime import date
import matplotlib.pyplot as plt
%matplotlib inline

# disable chained assignments
pd.options.mode.chained_assignment = None

# See all data
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

In [54]:
df = pd.read_csv("crm_data.csv")

In [55]:
df.head(1)

Unnamed: 0,Invoice Date,Invoice #,Location City,Location State,Location Zip,Job Type,Zone,Total,Completion Date,Job Class
0,2021-01-15,18969046,Colorado Springs,CO,80917.0,Sprinkler Repair,Powers,49.0,2021-01-15,Service


## 2. Data Exploration

In [56]:
# What is the shape of our data?
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15754 entries, 0 to 15753
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Invoice Date     15753 non-null  object 
 1   Invoice #        15754 non-null  object 
 2   Location City    15709 non-null  object 
 3   Location State   15694 non-null  object 
 4   Location Zip     15673 non-null  float64
 5   Job Type         15754 non-null  object 
 6   Zone             15664 non-null  object 
 7   Total            15754 non-null  float64
 8   Completion Date  15753 non-null  object 
 9   Job Class        3354 non-null   object 
dtypes: float64(2), object(8)
memory usage: 1.2+ MB


## 3. Data Cleaning

### A) Drop Useless Columns

In [57]:
df = df.drop(columns=[
    "Job Class", # Not useful for analysis
    "Zone",
    "Job Type",
    "Location State",
    "Location City"]) # Could use Address or Location

df.head()

Unnamed: 0,Invoice Date,Invoice #,Location Zip,Total,Completion Date
0,2021-01-15,18969046,80917.0,49.0,2021-01-15
1,2020-12-15,18969302,80906.0,84.0,2020-12-15
2,2021-01-15,18969558,80921.0,84.0,2021-01-15
3,2021-01-15,18969814,80831.0,0.0,2021-01-15
4,2021-01-18,18970070,80831.0,84.0,2021-01-18


### B) Drop Null Values

In [58]:
# How many null values and which columns?
df.isnull().sum()

Invoice Date       1 
Invoice #          0 
Location Zip       81
Total              0 
Completion Date    1 
dtype: int64

In [59]:
# Drop all null values
df = df.dropna()

In [60]:
df.isnull().sum()

Invoice Date       0
Invoice #          0
Location Zip       0
Total              0
Completion Date    0
dtype: int64

### C) Rename Columns

In [61]:
# Clean up column names
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')\
            .str.replace('(', '')\
            .str.replace(')', '')\
            .str.replace('#', 'number')

  df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')\


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15673 entries, 0 to 15752
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   invoice_date     15673 non-null  object 
 1   invoice_number   15673 non-null  object 
 2   location_zip     15673 non-null  float64
 3   total            15673 non-null  float64
 4   completion_date  15673 non-null  object 
dtypes: float64(2), object(3)
memory usage: 734.7+ KB


## 4. Export Clean Data 

In [63]:
df.to_csv("clean_crm_data.csv", encoding='utf-8', index=False)

In [64]:
check = pd.read_csv("clean_crm_data.csv")
check.head()

Unnamed: 0,invoice_date,invoice_number,location_zip,total,completion_date
0,2021-01-15,18969046,80917.0,49.0,2021-01-15
1,2020-12-15,18969302,80906.0,84.0,2020-12-15
2,2021-01-15,18969558,80921.0,84.0,2021-01-15
3,2021-01-15,18969814,80831.0,0.0,2021-01-15
4,2021-01-18,18970070,80831.0,84.0,2021-01-18
