In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sns

## Read data

In [2]:
df = pd.read_csv("./output/all-data.csv")

## Data Cleaning

#### Get concise summary of the dataframe 

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186850 entries, 0 to 186849
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          186305 non-null  object
 1   Product           186305 non-null  object
 2   Quantity Ordered  186305 non-null  object
 3   Price Each        186305 non-null  object
 4   Order Date        186305 non-null  object
 5   Purchase Address  186305 non-null  object
dtypes: object(6)
memory usage: 8.6+ MB


### First: Column labels

In [4]:
df.columns

Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')

##### We notice many problems with column labels:
1. They are long.
2. There are spaces in each label.

#### Rename

In [5]:
df.rename(columns={"Quantity Ordered":"Quantity",
                   "Price Each":"unit price","Order Date":"Date",
                   "Purchase Address":"Address"},inplace=True)

#### Capitalize

In [6]:
df.rename(str.title, axis='columns',inplace=True)

#### Remove spaces 

In [7]:
df.columns=df.columns.str.replace(" ","")

##### Let's look at them now

In [8]:
df.columns

Index(['OrderId', 'Product', 'Quantity', 'UnitPrice', 'Date', 'Address'], dtype='object')

### Second: Cast data types

#### Convert ```"OrderId","Quantity","Price"``` columns to numeric values

In [9]:
df[["OrderId","Quantity","UnitPrice"]]=df[["OrderId","Quantity","UnitPrice"]].apply(pd.to_numeric,errors="coerce",downcast ='integer')

#### Convert ```"Date"``` column to datetime object

In [10]:
df["Date"] = pd.to_datetime(df["Date"],format = "%d/%m/%y %H:%M",errors="coerce")

#### Drop NaN values

In [11]:
df.dropna(inplace=True)

#### Convert ```"OrderId","Quantity"``` columns to integer
The data type of these columns is **float** because there were **NaN values** at them, but after dropping them in the last step, we can now convert the columns into **integer**.

In [12]:
df[["OrderId","Quantity"]].dtypes

OrderId     float64
Quantity    float64
dtype: object

In [13]:
convert_dict = {'OrderId': int,
                'Quantity': int
               }
  
df = df.astype(convert_dict)

### Third: Set the ```"Date"``` column as index

In [14]:
df.set_index("Date",inplace=True)
df.sort_index(inplace=True) #sort rows by their date

In [15]:
df.head()

Unnamed: 0_level_0,OrderId,Product,Quantity,UnitPrice,Address
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-01 03:07:00,147268,Wired Headphones,1,11.99,"9 Lake St, New York City, NY 10001"
2019-01-01 03:40:00,148041,USB-C Charging Cable,1,11.95,"760 Church St, San Francisco, CA 94016"
2019-01-01 04:56:00,149343,Apple Airpods Headphones,1,150.0,"735 5th St, New York City, NY 10001"
2019-01-01 05:53:00,149964,AAA Batteries (4-pack),1,2.99,"75 Jackson St, Dallas, TX 75001"
2019-01-01 06:03:00,149350,USB-C Charging Cable,2,11.95,"943 2nd St, Atlanta, GA 30301"
