# Planning Methods: Part II, Spring 2023

# Lab 1: Stats and Python Refresher

**About This Lab**
* We will be running through this notebook together. If you have a clarifying question or other question of broad interest, feel free to interrupt and ask it! 
* We recognize that there are many modes of learning. Please go with what works best for you. That might be printing out the Jupyter notebook, duplicating it such that you can refer to the original, working directly in it. Up to you! There isn't a single right way.
* This lab requires that you download the following file and place it in the same directory as this Jupyter notebook:
    * `property_data.csv`

## Objectives
By the end of this lab, you will have reviewed how to:
1. Read and write files
2. Check for and drop nulls
3. Create subdataframes
5. Produce descriptive statistics
6. Conduct statistical tests

## 1 Import packages

In [None]:
import pandas as pd
import numpy as np

pd.options.display.float_format = '{:.2f}'.format 
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)

%matplotlib inline
import matplotlib.pylab as pylab
import matplotlib.pyplot as plt

from scipy import stats
from scipy.stats import t, chisquare, iqr
from scipy.stats import ttest_ind

import warnings 
warnings.filterwarnings('ignore')

In [None]:
pip install researchpy

In [None]:
import researchpy as rp

## 2 Read data

In [None]:
raw = pd.read_csv('Data/property_data.csv')
raw.head()

In [None]:
# check dimension
raw.shape

In [None]:
# check length
len(raw)

In [None]:
raw.dtypes

### 2.1 Check and drop nulls

In [None]:
# check for null values
raw.isnull().sum()

In [None]:
# drop NaN values
df = raw.dropna().reset_index(drop = True) #If you set drop = True , the current index will be deleted entirely and the numeric index will replace it.
df.head()

### 2.2 Check for outliers
Plus a sneak preview of plots!

<img src = 'Data/boxplots.jpg' width = 500>
Source: https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

In [None]:
# visualize population density
x = df['pop_dens']
plt.boxplot(x)
plt.show()
plt.hist(x, 250)
plt.show()

In [None]:
# visualize price
x = df['price_000']
plt.boxplot(x)
plt.show()
plt.hist(x, 250)
plt.show()

In [None]:
var = df['price_000']
q_75 = np.quantile(var, 0.75)
q_75

### Optional: how many observations would be dropped if we got rid of 'price_000' outliers?

In [None]:
# step 1: calculate interquartile range
var = df['price_000']

q_75 = np.quantile(var, 0.75)
q_25 = np.quantile(var, 0.25)
q_50 = np.quantile(var, 0.5) ### this is also the median

iqr_calc = q_75 - q_25 ### this should give the same output as the function scipy.stats.iqr()
print(iqr_calc)

In [None]:
# step 2: use the 1.5xIQR rule 
outliers = df[(var < (q_25 - 1.5 * iqr_calc))|(var > (q_75 + 1.5 * iqr_calc))]

print(len(outliers), len(df))

outliers

### 2.3 Export data (for future labs)
Name it whatever you'd like and remeber where you save it so you can access next week.

In [None]:
df.to_csv('clean_property_data.csv')

### 2.4 Create sub-dataframe

In [None]:
sub_df = df[['house','apt','price_000','age_0_10','age_20_more','pcn_green','num_room']].copy() # select column by names
sub_df.head()

In [None]:
# rename variables of interest
sub_df.rename(columns={"price_000":"price", 
                   "age_0_10":"age_new", 
                   "age_20_more":"age_old", 
                   "num_room":"rooms"}, inplace = True)

In [None]:
# slicing using loc and iloc

## 3 Describe variables

### 3.1 Continuous variable

In [None]:
# descriptive stats for property price
sub_df['price'].describe()

If we're only interested in certain statistics, we can also call them up specifically.

In [None]:
# print the mean, median and standard diviation of price
print ("The price mean is " + str(sub_df['price'].mean()))
print ("The price median is " + str(sub_df['price'].median()))
print ("The price stdev " + str(sub_df['price'].std()))

Next week we'll learn how to use a histogram to visualize the distribution of a continous variable.


### 3.2 Discrete numeric variable (dummy variable)

In [None]:
# descriptive stats for house dv
sub_df['house'].describe()

In [None]:
# we can also use the value_counts function (in general, it gives us a better sense of categorical variables)
sub_df['house'].value_counts()

In [None]:
# and we can normalize value_counts to get percentages
sub_df["house"].value_counts(normalize = True)

#### 3.2.1 Stats for all variables

In [None]:
# these functions have been helpful for individual variables, but say you want to see summary stats for ALL the 
# variables in your dataframe at once?
sub_df.describe().T

In [None]:
# try deleting the .T to see what happens if you don't use it - either way is fine!

## 4 Statistical tests

In [None]:
# define universal set of statistics to be called with ".agg" command
stats = ['count','min','max','mean', 'median', 'std']

### 4.1 T-test (of means)

#### 4.1.1 Do apartments have different prices from houses?

In [None]:
# descriptive price stats for apartment dv 
# groupby and aggregate functions are helpful for looking at crosstabulated summary statistics
sub_df["price"].groupby(sub_df["apt"]).agg(stats)

In [None]:
# create apt and non-apt price variables
apt_p = sub_df[sub_df.apt == 1].price #apartment price
n_apt_p = sub_df[sub_df.apt == 0].price #non-apartment price

In [None]:
# run t-test using ttest_ind function from scipy.stats package
ttest_ind(apt_p, n_apt_p, equal_var = False, nan_policy = "omit")

Read documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [None]:
# if you wanted to normalize the price of a property by the number of bedrooms, how would you change the code? 

# create per room price variable in dataframe
sub_df['pp_rm'] = sub_df['price']/sub_df['rooms']

# create variables for t-test
apt_rm_p = sub_df[df.apt == 1].pp_rm #Apartment Price per Room
n_apt_rm_p = sub_df[df.apt == 0].pp_rm #Non-Apartment Price per Room

# run t-test
ttest_ind(apt_rm_p, n_apt_rm_p, equal_var = False, nan_policy="omit")

#### 4.1.2 Is the price of newer apartments different from older apartments?

In [None]:
# what descriptive stats are relevant here? 
# create subdataframe, group price of apartments by new vs. old
apt_p = (sub_df[sub_df.apt == 1].price) #price only of apartments

apt_p.groupby(sub_df["age_new"]).agg(stats)

In [None]:
# create old and new apartment price variables
o_apt_p = sub_df[(sub_df.age_new == 0) & (sub_df.apt == 1)].price #price of old apartments
y_apt_p = sub_df[(sub_df.age_new == 1) & (sub_df.apt == 1)].price #price of young apartments

In [None]:
# run t-test
ttest_ind(o_apt_p, y_apt_p, equal_var = False, nan_policy="omit")

### 4.2 Chi-square test (of proportions)

#### 4.2.1 Are houses more likely to be older (age_20_more) or younger?

In [None]:
# descriptive stats (crosstab)
pd.crosstab(sub_df['house'], sub_df['age_old'], margins = True, margins_name = 'Total')

Documentation: https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html

In [None]:
# normalize by row ('index') - could also normalize by 'columns'
pd.crosstab(sub_df['house'], sub_df['age_old'], normalize = 'index', margins = True, margins_name = 'Total')

In [None]:
# run chi-square test
table, results = rp.crosstab(sub_df["house"], sub_df["age_old"], prop = "row", test = "chi-square")

In [None]:
# view table
table

In [None]:
# view results
results