# COGS 108 - Final Project

## Important

- ONE, and only one, member of your group should upload this notebook to TritonED. 
- Each member of the group will receive the same grade on this assignment. 
- Keep the file name the same: submit the file 'FinalProject.ipynb'.
- Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

## Group Members: Fill in the Student IDs of each group member here

Replace the lines below to list each persons full student ID, ucsd email and full name.

- A12814729, vkhua@ucsd.edu, Vivian Hua
- A11983710, css003@ucsd.edu, Cody Sophearum Smith
- U07979059, ddjiang@ucsd.edu, Derek De Gui Jiang
- A13497348, allacuna@ucsd.edu, Allyson Llacuna
- A12433857, Jor029@ucsd.edu, Joel Ramirez
- A11774341, dkovelma@ucsd.edu, Daniel Kovelman-Ottilie



Start your project here.

## Introduction and Background

The purpose of this project is to determine how do different regional factors influence unemployment rate in addition to attempting to come up with a model that could predict unemployment rate using strongly correlated factors. All correlations used for the formation of the unemployment rate prediction equation will be derived from a dataset containing various information from several different Wal-Mart stores within the United States. Wal-Mart data specfically will be used for the purposes of this project because Wal-Mart, as the "largest private sector employer in the US," is often a "lightning rod for criticism" concerning issues of worker compensation, pay, and employment (1). In addition to being the largest private employer, research has also shown that Wal-Mart is so huge in the United States that there can even be implications for regional unemployment rates when a Wal-Mart moves into an area (2). Therefore, any correlations that can be found might be applicable and useful to other corporations wanting to open up another store in a specific area.


Based upon previous findings on the relationship between unemployment rate and inflation(3) and to changes in gas prices(4), we hypothesized that factors such as comsumer price index (CPI), fuel price, and weekly sales will correlate strongly with unemployment rate with a negative, postive, and negative correlation respectively. Therefore, these will be the factors that will be included in the overall unemployment rate prediction equation. In addition, we hypothesized that other factors, such as average regional temerpature and holidays, would have a weak correlation with unemployment rate over time and will be dropped from the overall prediction equation due to negligble effects on unemployment.

References (include links):
- 1) Bhattarai, Abha, and Todd C. Frankel. "Walmart Said It's Giving Its Employees A Raise. And Then It Closed 63 Stores." The Washington Post, 11 Jan. 2018, https://www.washingtonpost.com/news/business/wp/2018/01/11/walmart-to-raise-starting-hourly-wage-to-11-offer-paid-parental-leave/?fbclid=IwAR1L5UfHBA78aJ-vkee9tJZsBCpDlQyedVMNBrzrwdORjHCS-rz_LP1KMKk&noredirect=on&utm_term=.d01ffabc9bfa

- 2) Keil, Stanley R. and Lee C. Spector. "The Impact of Wal-Mart On Income And Unemployment Differentials In Alabama." The Review of Regional Studies 35.3 (2005): 336-355. Web. 16 Feb. 2019 https://pdfs.semanticscholar.org/1650/e455a3185c4b257b7b30c971c1606562ac31.pdf?fbclid=IwAR0ECEpANJRWRqDiftbjiqrdp_FseDbt2LIY4uzjM5pzgfi8TF-kwiGBPdk 

- 3) Picardo, Elvis. "How Inflation and Unemployment Rate Are Related." Investopedia, 11 May 2018, https://www.investopedia.com/articles/markets/081515/how-inflation-and-unemployment-are-related.asp

- 4) "The Price Of Gas And The Unemployment Rate." Seeking Alpha, 14 Feb. 2011, https://seekingalpha.com/article/252704-the-price-of-gas-and-the-unemployment-rate?page=2&fbclid=IwAR2t9FILcjgUuubDZ8fAUdEhVRoc0ILlh35LnpJ4vRC7b5ar8e796JMWzMk

## Data Description

For this project, we used datasets found from a Kaggle competition (https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data). Of the datasets available, we used two of them, specifically features.csv and train.csv. The data span about 3 years of observations, and is entirely anonymized.

features.csv details:
features.csv contains 8190 observations, for data consisting of store number, date, temperature, fuel price, Markdown 1-5, CPI, unemployment, and IsHoliday. Store and date columns are strings and IsHoliday a boolean, the rest are floats.  

train.csv details:
train.csv contains 421,570 observations, for data consisting of store number, date, department, weekly sales, and IsHoliday. Store, date, and department are strings, IsHoliday a boolean, and weekly sales a float. 

Fuel price is in dollars, CPI is a ratio, temperature is in Fahrenheit, and unemployment is a percentage. 

For the analysis we performed, Markdown 1-5 and IsHoliday were not used, although IsHoliday was still retained in the dataset to aid in considering whether to remove outliers or not during the cleaning process. 

In [1]:
#The imports for this project
import pandas as pd
import numpy as np
import seaborn as sp
import matplotlib as mplot
from collections import defaultdict
from scipy import stats
import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest
from sklearn import preprocessing

## Data Cleaning

There were a few jobs to do to make the data usable for the types of analysis we desire. Since the datasets were made with a particular cause in mind, not all the data is relevant, as well as some of the observations are missing certain key data or are in a datatype not conducive to our analysis. 
To clean the data, we first removed columns not relevant to our analysis. The columns named Markdown 1-5 in features.csv were not be used in the analysis. To aid the analysis, we want all the relevant data in one dataset, and we want to use the weekly sales information from train.csv. The issue with train.csv was that not only are the observations split across store and date as with features.csv, but also across department. In order to add them to features, we first summed the weekly sales for each department on the same date in the same store, then we added that total weekly sales to the features dataset on store and date. Once that was accomplished, we converted the data in the "Date" column to be an integer instead of a string for plotting and analysis convenience. Lastly, we removed rows with missing unemployment data and checked for outliers.

In [2]:
weeksalesdb = pd.read_csv('train.csv')
features = pd.read_csv('features.csv')

In [3]:
features.drop(columns = ['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5'], inplace = True)

In [4]:
#converts weeksalesdb rows into dict and merging departments
agg_sales = defaultdict(int)
for ind, sale in weeksalesdb.iterrows():
    agg_sales[str(sale['Store'])+'.'+sale['Date']] += sale['Weekly_Sales']

In [None]:
#converts back into db with updated sales value
storesales = pd.DataFrame(columns=['Store','Date','Weekly_Sales'])

for key, value in agg_sales.items():
    store, date = key.split('.')
    newrow = [store, date, value]
    storesales.loc[len(storesales)] = newrow

In [None]:
#function to convert date to integer
def convert_date(date):
    date = date.strip()
    date = date.replace('-','')
    date = date.strip()
    return int(date)

#convert date columns
storesales['Date'] = storesales['Date'].apply(convert_date)
features['Date'] = features['Date'].apply(convert_date)

In [None]:
#convert store column to int (was str)
storesales['Store'] = pd.to_numeric(storesales['Store'])
features.head(10)

In [None]:
#exported csv files
storesales.to_csv('storesales.csv')
features.to_csv('features2.csv')

In [None]:
#merged two dataframes together on the store and date column
merged = pd.merge(storesales, features, on =['Store', 'Date'], how = 'outer')

In [None]:
#removed rows where unemployment data was empty
merged.dropna(subset = ['Unemployment'], inplace = True)

In [None]:
#export cleaned features
#merged.to_csv('features_clean.csv', index = False)

In [None]:
merged.head(10)

In [None]:
#checked for outliers in the weekly sales column, most of these are on or near holidays.
outliers = merged[merged['Weekly_Sales'] > merged['Weekly_Sales'].mean() + 3 * merged['Weekly_Sales'].std()]
outliers

In [None]:
features = pd.read_csv('features_clean.csv')

## Data Visualization

In [None]:
features.plot.scatter(x='Temperature', y='Unemployment')

While initially hypothesized that temperature would not be related, it is interesting to note that at the higher end of temperature, there are clusters of higher unemployment in the higher temperatures that are not present at lower temperatures.

In [None]:
features['Temperature'].plot.hist()
mplot.pyplot.xlabel('Temperature')
mplot.pyplot.ylabel('Frequency')

There was a concern that temperature's distribution may be problematic for analysis, but it appears normally distributed.

In [None]:
features.plot.scatter(x='CPI', y='Unemployment')

CPI appears to have a bimodal distribution in the dataset, which is further supported in the histogram below.

In [None]:
features['CPI'].plot.hist()
mplot.pyplot.xlabel('CPI')
mplot.pyplot.ylabel('Frequency')

In [None]:
features.plot.scatter(x='Weekly_Sales', y='Unemployment')
# Convert large values into log base 2 for easier data visualization
# Weekly_sales is measured in dollars
mplot.pyplot.xscale('log', basex=2)

Weekly sales have a huge range of values, so this graph's x-axis is on a log base 2 scale. The interesting part of this visualization is the extreme high end of weekly sales values having lower unemployment rates.

In [None]:
features['Weekly_Sales'].plot.hist()
mplot.pyplot.xlabel('Weekly_Sales')
mplot.pyplot.ylabel('Frequency')
mplot.pyplot.xscale('log', basex=2)

There are a few very extreme sales values, but these days, when cleaning the data and looking at the outliers, tended to coincide with major holidays, which are arguably still representative of important data.

In [None]:
features.plot.scatter(x='Fuel_Price', y='Unemployment')

In the above graph, there appears to be a huge spread of data, which appears to be weakly correlated. 

In [None]:
features['Fuel_Price'].plot.hist()
mplot.pyplot.xlabel('Fuel_Price')
mplot.pyplot.ylabel('Frequency')

## Data Analysis and Results

<font size="4">Here, we will be analyzing how much each factor influences unemployment before developing a model.</font>

In [None]:
out_sales, pred_sales = patsy.dmatrices('Unemployment ~ Weekly_Sales', features)
mod_sales = sm.OLS(out_sales, pred_sales)
res_sales = mod_sales.fit()

out_temp, pred_temp = patsy.dmatrices('Unemployment ~ Temperature', features)
mod_temp = sm.OLS(out_temp, pred_temp)
res_temp = mod_temp.fit()

out_fuel, pred_fuel = patsy.dmatrices('Unemployment ~ Fuel_Price', features)
mod_fuel = sm.OLS(out_fuel, pred_fuel)
res_fuel = mod_fuel.fit()

out_cpi, pred_cpi = patsy.dmatrices('Unemployment ~ CPI', features)
mod_cpi = sm.OLS(out_cpi, pred_cpi)
res_cpi = mod_cpi.fit()

In [None]:
print(res_sales.summary(), res_temp.summary(), res_fuel.summary(), res_cpi.summary())

<font size="4">After finding which features were most correlated, we attempted to build a model to predict unemployment from a combination of features, first using the ones that appeared most relevant.</font>


In [None]:
out1, pred1 = patsy.dmatrices('Unemployment ~ Weekly_Sales + CPI', features)
mod1 = sm.OLS(out1, pred1)
res1 = mod1.fit()

In [None]:
print(res1.summary())

<font size="4">To check what happens when all the factors are incorporated, we made an additional model with all the features.</font>

In [None]:
out2, pred2 = patsy.dmatrices('Unemployment ~ Weekly_Sales + CPI + Fuel_Price + Temperature', features)
mod2 = sm.OLS(out2, pred2)
res2 = mod2.fit()

In [None]:
print(res2.summary())

<font size="4">The predictive power of the model went up when incorporating the additional features, as well as in the summary all factors are found to be significant (P>|t|). To further investigate the relationship between the features and unemployment, we standardize the data below.</font>

In [None]:
def standardizevalue(self, df, label):
    df = df.copy(deep=True)
    series = df.loc[:, label]
    avg = series.mean()
    stdv = series.std()
    series_standardized = (series - avg)/stdv
    return series_standardized

In [None]:
#columns we want to standardize
numericcolumns = features[['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']]
#get the column names
names = numericcolumns.columns
#create scaler
scaler = preprocessing.StandardScaler()
#apply transformation
normaled = scaler.fit_transform(numericcolumns)
normaled = pd.DataFrame(normaled, columns=names)
#delete the columns to be replaced with new values
features_normal = features.drop(labels = names, axis = 'columns')
#add in the columns from the normalized df
features_normal[names] = normaled
#rearrange columns to be like original features
features_normal = features_normal[features.columns]
#export csv file
features_normal.to_csv('features_normal.csv', index = False)

features_normal

In [None]:
out3, pred3 = patsy.dmatrices('Unemployment ~ Weekly_Sales + CPI + Fuel_Price + Temperature', features_normal)
mod3 = sm.OLS(out3, pred3)
res3 = mod3.fit()

In [None]:
print(res3.summary())

<font size="4">With standardization, it is clearer to see the impact each feature has on unemployment. CPI is the most strongly correlated, followed (surprisingly) by temperature, then weekly sales, and then fuel price.</font>

## Privacy/Ethics Concerns

As mentioned before, the data about Walmart was found on Kaggle, a public space where datasets can be accessible to those with the link to that information and an account. Since most of these are being used for competition purposes, the purpose of providing the data was to have anyone analyze it, meaning that the public was free to download the data with no restrictions nor hold additional responsibilities of it. The owner of the data, however, was not provided in Kaggle and makes its credibility questionable. Not knowing who provided the data and how they retrieved such information raises questions such as “how much of the data given was extracted from the original source?” and “how accurate is this information?” 

In concern with the actual data, our choice of datasets has not changed since the proposal. The data itself is extremely anonymized, making it nearly impossible to determine the identities of any of the stores in question, as the regions for each store are not specified nor the stores named anything beyond Store 1 through 45. The closest one could narrow down the stores is by comparing the temperature for each day for each store against weather reports for those days to determine an area the store could be in. Even if the identity of the store were found, there is no other data that could implicate anything about any individuals working or shopping there. 

## Conclusions and Discussion

In the analysis, we found that all four of the main factors analyzed were significant in predicting unemployment. In line with our hypothesis, CPI, weekly sales, and fuel price did have negative correlations with unemployment rate. Where we were surprised, however, was to find that temperature did have a significant correlation with unemployment in our model, and a positive one, at that. 

Given background research on how CPI and fuel price impact unemployment rate, the results we found are aligned with these findings. Weekly sales intuitively makes sense as with an increase in weekly sales, there would be a higher need for workers as those are busier times for the store, leading to a negative relationship with unemployment. Temperature may have to do with seasonal changes in demand. In the data visualization section, while temperature and unemployment do not appear to have a strong correlation initially, there is a cluster of higher unemployment rates when the temperature is high, which may indicate that unemployment is lower during the colder seasons. This may be due to how certain major holidays that are important for retail stores like Christmas or Thanksgiving are during the colder times of the year. 

Using the coefficients found in the model, we can see how much a change in a factor can influence unemployment. For example, a 10 million dollar increase in weekly sales would lead to about a 4 percent decrease in unemployment. Alternatively, to get a 1 percent decrease in unemployment, you would need an increase in more than 2.503 million dollars. Math shown below.  


In [None]:
10000000*-3.995e-07 #10 million (dollars) times coefficient for weekly sales

In [None]:
-1/-3.995e-07 #reverse process to find how much change in dollars to get a 1% decrease in unemployment

While factors such as CPI or temperature are not factors that can be readily altered by corporations to produce an effect, this information can be used to educate where, when, or how a company should expand its operations. Temperature, as mentioned before, being correlated with unemployment may be due to seasonal demand for work, which a corporation would definitely need to consider when hiring in different times of year in terms of how strong there is a supply and demand for labor, impacting compensation, benefits, or other factors that could influence the cost of operation. Regional unemployment itself can also be extremely important for companies to consider! Companies like Wal-Mart have pressures on them due to the low unemployment rate in recent years (5). Given that features such as temperature, CPI, fuel price, and weekly sales are correlated with unemployment in our model, data from those for a region can help a company determine whether opening or maintaining its current operations in that region are worthwhile or too costly. 

References
- 5) Taylor, Kate. "The Unemployment Rate Has Fallen To A 48-Year Low, And It's Terrifying News For Walmart, McDonald's, And JCPenney." Business Insider, 5 Oct. 2018, https://www.businessinsider.com/unemployment-rate-sparks-hiring-concerns-2018-10?fbclid=IwAR2PW52Drh-x60tGPTKNiZauPHMSwv9qjAQqdLWDlkW06uDcsgoxOUGvB3w