# Exploratory Data Analysis<a id='Exploratory_Data_Analysis'></a>

## 1.1 Contents<a id='3.1_Contents'></a>
* [1 Exploratory Data Analysis](#3_Exploratory_Data_Analysis)
  * [1.1 Imports](#1.1_Imports)
  * [1.2 Load The Data](#1.2_Load_The_Data)
  * [1.3 Explore The Data](#1.3_Explore_The_Data)
    * [1.3.1 Visualizing High Dimensional Data](#1.3.1_Visualizing_High_Dimensional_Data)
      * [1.3.1.1 Average Door Swings Per State](#1.3.1.1_Average_Door_Swings_Per_State)

### 1.1 Imports

In [42]:
import numpy as np
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

### 1.2 Load The Data

In [43]:
df = pd.read_excel(r'C:\Users\asiu200\OneDrive - Comcast\Python\Springboard\Data Wrangling.xlsx')

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74789 entries, 0 to 74788
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Traffic_Date      74789 non-null  datetime64[ns]
 1   STORE_NAME        74789 non-null  object        
 2   STORE_CITY_NAME   74789 non-null  object        
 3   STORE_STATE_CODE  74789 non-null  object        
 4   Door_Swings       74789 non-null  int64         
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 2.9+ MB


In [45]:
df.rename(columns={'STORE_NAME' : 'store_name', 'Traffic_Date' : 'date', 'STORE_CITY_NAME' : 'city', 'STORE_STATE_CODE' : 'state','Door_Swings' : 'door_swings'}, inplace=True)

In [46]:
df.head()

Unnamed: 0,date,store_name,city,state,door_swings
0,2014-12-22,"3351 - Albuquerque, NM (XF)",Albuquerque,NM,656
1,2014-12-22,"3352 - Lakewood, CO (XF)",Lakewood,CO,452
2,2014-12-22,"3353 - Colorado Springs, CO (XF)",Colorado Springs,CO,562
3,2014-12-22,"3354 - Thornton, CO (XF)",Thornton,CO,594
4,2014-12-22,"3356 - Boulder, CO (XF)",Boulder,CO,369


### 1.3 Explore The Data

In [47]:
print(df.store_name.value_counts())

3353 - Colorado Springs, CO (XF)                                2891
3354 - Thornton, CO (XF)                                        2891
3352 - Lakewood, CO (XF)                                        2891
3456 - Layton2, UT (XF)                                         2887
3357 - Centennial, CO (XF)                                      2879
3453 - Orem, UT (XF)                                            2857
3351 - Albuquerque, NM (XF)                                     2841
3356 - Boulder, CO (XF)                                         2822
3455 - Draper, UT (XF)                                          2718
3359 - Loveland, CO (XF)                                        2489
3360 - Arvada, CO (XF)                                          2134
3457 - Salt Lake City, UT (XF)                                  2041
3361 - Longmont, CO (XF)                                        1904
3358 - Denver, CO (XF)                                          1829
3362 - Pueblo, CO (XF)            

There are stores that have higher amounts of data points because either they opened after 01-01-2019 or the stores have closed down and are no longer open.

In [48]:
count = df[['state', 'store_name']].nunique()
print(count)

state          4
store_name    50
dtype: int64


We cover four states and have 50 stores in our specific region.

In [49]:
print(df.groupby('state')['store_name'].nunique())

state
AZ     2
CO    29
NM     6
UT    13
Name: store_name, dtype: int64


Out of the four states, CO has the most stores open, while AZ has the least amount.

### 1.3.1 Visualizing High Dimensional Data

#### 1.3.1.1 Average Door Swings Per State

In [50]:
state_avg_ds = df.groupby('state')['door_swings'].mean()
state_avg_ds.head()

state
AZ    190.512353
CO    260.751677
NM    377.025080
UT    202.568478
Name: door_swings, dtype: float64

In [51]:
df.sort_values(['door_swings'],ascending=False).head(20)

Unnamed: 0,date,store_name,city,state,door_swings
9577,2017-08-11,"3360 - Arvada, CO (XF)",Arvada,CO,1683
9238,2017-07-16,"3354 - Thornton, CO (XF)",Thornton,CO,1582
83,2014-12-31,"3353 - Colorado Springs, CO (XF)",Colorado Springs,CO,1509
38246,2020-09-03,"3351 - Albuquerque, NM (XF)",Albuquerque,NM,1291
16860,2018-10-27,"3365 - Gardens on Havana - Aurora, CO (XF)",Aurora,CO,1274
9842,2017-09-01,"3351 - Albuquerque, NM (XF)",Albuquerque,NM,1272
18287,2018-12-29,"3360 - Arvada, CO (XF)",Arvada,CO,1249
39374,2020-10-01,"3351 - Albuquerque, NM (XF)",Albuquerque,NM,1234
22417,2019-06-03,"3365 - Gardens on Havana - Aurora, CO (XF)",Aurora,CO,1224
38207,2020-09-02,"3351 - Albuquerque, NM (XF)",Albuquerque,NM,1218


It is interesting to note that NM has the highest average door swings, and the Albuquerque store in NM has 10 out of the 20 highest door swings in the past eight years.