## Statistical Analysis

Now that the data is cleaned and I've conducted some preliminary analysis, it's time to enrich my exploration with inferential statistics. My exploratory data analysis will include the following steps:
   1. Look for numerical variables in the dataset that are correlated.
   2. Run confidence intervals and significance tests on variables that are related to the project question.
   3. Re-evaluate the original project question based on statistical evidence.

Once again, I will focus on the listings dataset in this notebook, as the listing data hosts all numerical features that the dataset currently has. The reviews data, which contains the review text for different listings, will be interpreted using NLP later on in the project process, as will other text data in the listings file.

In [1]:
# import relevant packages
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

# remove warnings
import warnings
warnings.filterwarnings("ignore")

# set seaborn theme
sns.set(style='ticks')

%matplotlib inline

In [2]:
# load listings dataset
df = pd.read_csv('/Users/limesncoconuts2/datasets/capstone_one/los_angeles/los-angeles_listings.csv')
df.head()

Unnamed: 0,access,accommodates,availability_30,availability_365,availability_60,availability_90,bathrooms,bed_type,bedrooms,beds,...,phone_verification,photographer_verification,reviews_verification,selfie_verification,sent_id_verification,sesame_verification,sesame_offline_verification,weibo_verification,work_email_verification,zhima_selfie_verification
0,,2.0,30.0,365.0,60.0,90.0,1.0,Real Bed,1.0,1.0,...,True,False,True,False,False,False,False,False,False,False
1,,1.0,30.0,365.0,60.0,90.0,1.0,Real Bed,1.0,1.0,...,True,False,False,False,False,False,False,False,False,False
2,,1.0,15.0,344.0,39.0,69.0,1.0,Couch,1.0,1.0,...,True,False,True,False,False,False,False,False,False,False
3,,2.0,18.0,261.0,32.0,62.0,1.0,Real Bed,1.0,1.0,...,True,False,False,False,False,False,False,False,False,False
4,,1.0,30.0,365.0,60.0,90.0,1.0,Futon,1.0,1.0,...,True,False,True,False,False,False,False,False,False,False


This statistical analysis will focus on numerical and categorical data. The text data will be interpreted as features later on in the project process

In [3]:
# fix categories that didn't transfer from the data cleaning notebook?
to_category = ['property_type','room_type','bed_type','cancellation_policy','market','neighbourhood','city','state','calendar_updated','host_neighbourhood']
for col_name in to_category:
        try:
            df[col_name] = df[col_name].astype('category')
        except KeyError:
            pass

In [4]:
# turn boolean columns into 0 or 1 integers
for col in df:
    if df[col].dtype == 'bool':
        df[col] = df[col].astype('int64')

In [6]:
# cut down dataframe to include only numerical and categorical data
df = df.select_dtypes(include=['number', 'datetime', 'category'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599177 entries, 0 to 599176
Data columns (total 59 columns):
accommodates                        599177 non-null float64
availability_30                     599163 non-null float64
availability_365                    599163 non-null float64
availability_60                     599163 non-null float64
availability_90                     599163 non-null float64
bathrooms                           599177 non-null float64
bed_type                            599177 non-null category
bedrooms                            598639 non-null float64
beds                                598840 non-null float64
calculated_host_listings_count      599177 non-null float64
calendar_updated                    599177 non-null category
cancellation_policy                 585132 non-null category
city                                599088 non-null category
cleaning_fee_USD                    474852 non-null float64
extra_people_USD                    599177 no

In [None]:
cols = ['bathrooms', 'beds', 'city', 'number_of_reviews']# 'price_USD', 'property_type',
       #'review_scores_rating', 'reviews_per_month', 'room_type', 'square_feet', 'zipcode']
sns.pairplot(df)

In [None]:
# heat maps between variables that are correlated - have to have unique col values in pandas index
cols = ['bathrooms', 'beds', 'cancellation_policy', 'city', 'instant_bookable',
        'host_is_superhost', 'is_business_travel_ready,', 'number_of_reviews', 'price_USD', 'property_type',
       'review_scores_rating', 'reviews_per_month', 'room_type', 'square_feet'
       , 'zipcode']

sns.heatmap(df[cols].corr(), square=True, cmap='RdYlGn')