# Planning

Business stakeholders and end users often ask more general questions that are very hard to answer directly or extremely specific questions that are not going to achieve their underlying goal. This leads to miscommunication, time spent on work that is ultimately thrown away, or inadequate understanding of the underlying problem being investigated. As you gain experience with the data and domain, you gain a better understanding of problems and can ask more informative & specialized questions. Even then, however, it is important to work through this planning stage, as it is all too easy to get lost down a rabbit hole when working on a data science project.

**The goal** of this stage is to clearly define your goal(s), measures of success, and plans on how to achieve that.

**The deliverable** is documentation of your goal, your measure of success, and how you plan on getting there. If you haven't clearly defined success, you will not know when you have achieved it.

**How to get there:** You can get there by answering questions about the final product & formulating or identifying any initial hypotheses (from you or others).

Common questions include:

- What will the end product look like?
- What format will it be in?
- Who will it be delivered to?
- How will it be used?
- How will I know I'm done?
- What is my MVP?
- How will I know it's good enough?

Formulating hypotheses
- Is attribute V1 related to attribute V2?
- Is the mean of target variable Y for subset A significantly different from that of subset B?


In [52]:

#libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#ignore warnings
import warnings
warnings.filterwarnings('ignore')
#pipeline imports
import env

#stats imports
from scipy import stats
from scipy.stats import pearsonr, spearmanr

#sklearn imports
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler, QuantileTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

#math import
from math import sqrt

#project library
from wrangle import get_zillow_data
from wrangle import remove_outliers
# from prepare import clean_data, get_hist, get_box, remove_outliers, train_validate_test_split
# from explore import explore_univariate, exp_bivariate_categorical, exp_bivariate_continuous, exp_multivariate, exp_bivariate_categorical
# from evaluate import scale_it, rfe, select_kbest, get_baseline, get_residuals, plot_residual, regression_errors, baseline_mean_errors, better_than_baseline
# from model import model_baseline, linear_regression, tweedieregressor, lassolars, polynomialregression, model_test, plot_test_residuals, plot_distributions


#Hypothesis alpha
alpha = .05

ImportError: cannot import name 'remove_outliers' from 'wrangle' (/Users/alejandrovelasquez/codeup-data-science/regression_project_zillow/wrangle.py)

In [43]:
df = get_zillow_data()

In [44]:
df.head()

Unnamed: 0,bedrooms,bathrooms,square_feet,year,taxes,home_value,fips,zip_code,transaction_date
0,2.0,1.0,775.0,1952.0,1056.05,74118.0,6037.0,96338.0,2017-09-18
1,4.0,2.0,1497.0,1949.0,6793.4,557360.0,6037.0,96420.0,2017-09-18
2,3.0,2.0,1337.0,1956.0,5314.35,459600.0,6037.0,96161.0,2017-09-18
3,4.0,2.0,1631.0,2002.0,3294.29,262657.0,6037.0,96361.0,2017-09-18
4,3.0,3.0,1454.0,1989.0,2410.4,130027.0,6037.0,97318.0,2017-09-18


In [45]:
df.shape

(52441, 9)

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52441 entries, 0 to 52440
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   bedrooms          52441 non-null  float64
 1   bathrooms         52441 non-null  float64
 2   square_feet       52359 non-null  float64
 3   year              52325 non-null  float64
 4   taxes             52437 non-null  float64
 5   home_value        52440 non-null  float64
 6   fips              52441 non-null  float64
 7   zip_code          52415 non-null  float64
 8   transaction_date  52441 non-null  object 
dtypes: float64(8), object(1)
memory usage: 3.6+ MB


In [34]:
# convert treansaction_date to date object type
df['transaction_date'] = pd.to_datetime(df['transaction_date'])

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52441 entries, 0 to 52440
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   bedrooms          52441 non-null  float64       
 1   bathrooms         52441 non-null  float64       
 2   square_feet       52359 non-null  float64       
 3   year              52325 non-null  float64       
 4   taxes             52437 non-null  float64       
 5   home_value        52440 non-null  float64       
 6   fips              52441 non-null  float64       
 7   zip_code          52415 non-null  float64       
 8   transaction_date  52441 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(8)
memory usage: 3.6 MB


In [36]:
# There are some null values
df.isnull().sum()

bedrooms              0
bathrooms             0
square_feet          82
year                116
taxes                 4
home_value            1
fips                  0
zip_code             26
transaction_date      0
dtype: int64

In [37]:
# Create list of the names of the columsn 
df.columns.values

array(['bedrooms', 'bathrooms', 'square_feet', 'year', 'taxes',
       'home_value', 'fips', 'zip_code', 'transaction_date'], dtype=object)

In [38]:

cols = ['bedrooms', 'bathrooms', 'square_feet', 'year', 'taxes',
       'home_value', 'fips', 'zip_code', 'transaction_date']
for col in cols:
    print(col)
    print(df[col].value_counts(dropna=False,ascending=True).head(10))

bedrooms
14.0      1
11.0      1
10.0      2
12.0      3
9.0       8
8.0      24
7.0     106
0.0     137
1.0     612
6.0     635
Name: bedrooms, dtype: int64
bathrooms
13.0     1
18.0     1
11.0     3
8.5      3
10.0     5
9.0     13
7.5     16
6.5     47
8.0     53
7.0     88
Name: bathrooms, dtype: int64
square_feet
4566.0     1
3103.0     1
11704.0    1
3861.0     1
4967.0     1
4780.0     1
4357.0     1
4203.0     1
7770.0     1
4536.0     1
Name: square_feet, dtype: int64
year
1889.0    1
1880.0    1
1882.0    1
1897.0    1
1878.0    1
1892.0    1
1894.0    1
1893.0    3
2016.0    3
1887.0    3
Name: year, dtype: int64
taxes
1136.39     1
2818.02     1
3857.22     1
4871.52     1
15512.71    1
5838.96     1
9969.72     1
1602.96     1
19379.20    1
872.36      1
Name: taxes, dtype: int64
home_value
NaN          1
416775.0     1
358482.0     1
114633.0     1
1555151.0    1
1614974.0    1
382322.0     1
762655.0     1
296960.0     1
49636.0      1
Name: home_value, dtype: int64
fips

In [39]:
# dropping nulls
df = df.dropna()
df.isnull().sum()

bedrooms            0
bathrooms           0
square_feet         0
year                0
taxes               0
home_value          0
fips                0
zip_code            0
transaction_date    0
dtype: int64

In [49]:
def remove_outliers(df, k, col_list):
    ''' 
    
    Here, we remove outliers from a list of columns in a dataframe and return that dataframe
    
    '''
    
    for col in col_list:

        q1, q3 = df[col].quantile([.25, .75])  # get quartiles
        
        iqr = q3 - q1   # calculate interquartile range
        
        upper_bound = q3 + k * iqr   # get upper bound
        lower_bound = q1 - k * iqr   # get lower bound

        # return dataframe without outliers
        
        df = df[(df[col] > lower_bound) & (df[col] < upper_bound)]
        
    return df

In [50]:
# Use function to remove
df_copy = df
df_copy = remove_outliers(df, 1.5, cols)

TypeError: unsupported operand type(s) for -: 'str' and 'str'