# The Business Problem
Pawdacity is a leading pet store chain in Wyoming with 13 stores throughout the state. This year, Pawdacity would like to expand and open a 14th store. Your manager has asked you to perform an analysis to recommend the city for Pawdacity’s newest store, based on predicted yearly sales.

Your first step in predicting yearly sales is to first format and blend together data from different datasets and deal with outliers.

Your manager has given you the following information to work with:

- The monthly sales data for all of the Pawdacity stores for the year 2010.
- NAICS data on the most current sales of all competitor stores where total sales is equal to 12 months of sales.
- A partially parsed data file that can be used for population numbers.
- Demographic data (Households with individuals under 18, Land Area, Population Density, and Total Families) for each city and county in the state of Wyoming. For people who are unfamiliar with the US city system, a state contains counties and counties contains one or more cities.

In [None]:
# load packages
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 7]

## Step 1: Business and Data Understanding
Your project should include a description of the key business decisions that need to be made.

1. What decisions needs to be made?

To recommend the city for Pawdacity’s newest store, based on predicted yearly sales.

2. What data is needed to inform those decisions?

- Pawdacity Monthly Sales
- Population Data
- Demographic Data
- Competitor Sales

## Step 2: Building the Training Set
To properly build the model, and select predictor variables, create a dataset with the following columns:

In [None]:
# load data
monthly_sales = pd.read_csv('p2-2010-pawdacity-monthly-sales-p2-2010-pawdacity-monthly-sales.csv')
population_data = pd.read_csv('p2-partially-parsed-wy-web-scrape.csv')
demographic_data = pd.read_csv('p2-wy-demographic-data.csv')
competitor_data = pd.read_csv('p2-wy-453910-naics-data.csv')


In [None]:
print('monthly_sales')
monthly_sales

In [None]:
print('population_data')
population_data

In [None]:
print('demographic_data')
demographic_data

In [None]:
print('competitor_data')
competitor_data

In [None]:
# cleaning population data
file_name = 'p2-partially-parsed-wy-web-scrape.csv'
with open(file_name, encoding='utf8') as f:
    parser = f.readlines()

header = parser[0] #City|County,2014 Estimate,2010 Census,2000 Census 
content = []

for data_str in parser[1:100]:
    split_index = data_str.index(',')
    city, county = data_str[:split_index].replace('?', '').split('|')
    row = [city.strip(), county]
    other_data = data_str[split_index + 1:].replace('"', '')
    for td in BeautifulSoup(other_data).find_all('td'):
        value = td.text.replace(',', '') # remove , in value
        value = value.replace('-', '') # remove - in value
        value = value.split('[')[0] # remove [4] due to sup tag
        value = float(value) if value else None
        row.append(value)
    content.append(row)

columns = ['City', 'County', 'Estimate 2014', 'Census 2010', 'Census 2000']
population_data = pd.DataFrame(content, columns=columns, )
population_data.info()

In [None]:
# Add Total Pawdacity Sales data by sum Jan to Dec. 
month_columns = list(monthly_sales.columns)[5:17]
monthly_sales['Total Pawdacity Sales'] = monthly_sales[month_columns].sum(1)
# Create clean data
clean_data = monthly_sales.drop(month_columns, axis=1).rename(columns={'CITY': 'City'})

In [None]:
# Merge Population data to Clean sales data
clean_data = clean_data.merge(population_data[['City', 'Census 2010']], on='City', how='left')
clean_data

In [None]:
# Merge Demographic data to Clean sales data
clean_data = clean_data.merge(demographic_data, on='City', how='left')
clean_data

In [None]:
# Create summary on numerical variables
numerical_columns = ['City', 'Total Pawdacity Sales', 'Census 2010', 'Land Area', 'Households with Under 18', 'Population Density', 'Total Families']
a = clean_data[numerical_columns].sum()
summary = a.drop('City')
summary

## Step 3: Dealing with Outliers
Once you have created the dataset, look for outliers and figure out how deal with your outliers. Use the IQR method to determine if there are outlier cities for each of the variables and then justify which city that has at least one outlier value should be removed.