# Mozilla 2019 Outreachy Data Science Project

## Part 1: Initial Contribution

In [None]:
#importing packages need for exploratory data analysis and data visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### I'll start by using Pandas to convert the CSV file into a Dataframe.

In [None]:
#load in dataset
df = pd.read_csv('dataset.csv')
df.head()

### Thanks to Pandas built in .head() function, we can preview some of the columns from our dataset. However, this is just a sneak peak. 

I want to know more about each column and their datatypes. With this information, I can decide which data analysis methods are best suited for this dataset and whether some data cleaning is necessary.

In [None]:
#check column names
col_names = df.columns
print(list(col_names))

#get datatypes for each column
df.info()

**This dataset has a mix of datatypes including floats, integers, and objects (which represent strings).**

_sidenote:_
From the information above, we can see certain features have very little data to offer. For example, there are only seven properties mentioned have pools therefore we can go ahead and probably assume the PoolQC feature won't be the best choice for describing the Sales Price. Three other features with few data are MiscFeature, Alley, and Fence. 

## Data Cleaning and Manipulation

#### I've decided to drop the features with too few values since they won't be the best source of information.

I'm using pandas .drop() function to access these specific columns and drop them while still keeping the same number of rows.

In [None]:
df = df.drop(['PoolQC', 'Alley', 'Fence', 'MiscFeature'], axis=1)

#### I am using some basic built-in Pandas functions to assess the cleanliness of the dataset.

An easy thing to check for is whether or not the dataframe has any duplicate rows due to some error when the data was being compiled. Using pandas .duplicated() function, I created a temporary dataframe where all duplicates would be stored. It returned an empty dataframe, meaning there are no duplicate rows in the dataset.

In [None]:
#looking for duplicate rows (there are none)
duplicatesdf = df[df.duplicated()]
print(duplicatesdf)

Another thing to look out for when exploring your dataset is NaN or missing values. You wanna drop or fill NaN values so they don't cause errors or skew your data when you begin your analysis. Another handy built in pandas function is .isna() which will check if column has any null values and reutrns True or False

In [None]:
#checking if any columns contain nan values
df.isna().any()

I wanted to know more about the columns with NaN values so I used .sum() to get the total number for each column. Next, I'm going to consult the data_description file to get an understanding of those particular columns.

In [None]:
#examine nan values closer
print(df.isnull().sum())

#### Breakdown of columns with NaN values

- LotFrontage = linear feet of street connected to property. (These NaN values could be due to properties with no lot frontage as opposed to errors or missing information)
- MasVnrType, MasVnrArea = Homes that have mason veneers and the area in square feet. (These NaN values represent properties that don't have mason veneers, it is not due to an error)
- BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2 = These columns describe various aspects of the basement in these units. (According to the data description, NA symbolizes the homes without basements. Therefore we know these aren't due to errors)
- Electrical = type of electrical system in property. (There is only one NaN value probably due to a small error or missing information
- FireplaceQu = quality of fireplace (NaN values can be attributed to homes without a fireplace. This is not due to errors)
- GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond = descriptive information about properties with garages (NaN values represent homes with no garages)

**I've concluded that all the NaN values, excluding the Electrical column, are mostly valid and due to some properties having certain features that others don't.**

Since the rest of the values in the LotFrontage column are floats, I am using Pandas .fillna() function to put a 0 in place of NaN to represent properties with no Lot Frontage.

In [None]:
df['LotFrontage'].fillna(0.0, inplace = True) 

In [None]:
#checking for null values to confirm that it worked 
df['LotFrontage'].isnull().sum()

I am going to handle the mason veneer type and the mason veneer area featueres separately since one is a string object and the other is a float. I'm going to replace the type with NoMV (no mason veneer) and the area with the value 0.0

In [None]:
df['MasVnrType'].fillna('NoMV', inplace = True)
df['MasVnrArea'].fillna(0.0, inplace = True)

In [None]:
#checking for null values
df[['MasVnrType', 'MasVnrArea']].isnull().sum()

The values for the basement properties are all strings. I've decided to replace NA with NB (no basement) which creates a new class within that category to replace the missing values.

In [None]:
df[['BsmtQual', 'BsmtCond', 'BsmtExposure']].fillna('NB')

In [None]:
df[['BsmtFinType1', 'BsmtFinType2']] = df[['BsmtFinType1', 'BsmtFinType2']].fillna(0.0)

In [None]:
#checking for null values 
df[['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']].isnull().sum()

I believe the one electrical NaN value will be pretty inconsequential to the whole dataset. I'm going to change the value to NotSpecif so it wont throw an error later.

In [None]:
df['Electrical'].fillna('NotSpecif', inplace = True) 

In [None]:
#checking for nan values to see if it worked
df['Electrical'].isnull().sum()

Fireplace Quality is another string object. I am going to replace NA with NF (no fireplace) so that there isn't a null value for properties without this feature.

In [None]:
df['FireplaceQu'].fillna('NF', inplace = True)

In [None]:
#confirming that it worked
df['FireplaceQu'].isnull().sum()

Lastly, I am going to replace the null values in the Garage category with NG to represent properties without garages.

In [None]:
df[['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond']].fillna('NG')

In [None]:
df[['GarageCars', 'GarageArea']] = df[['GarageCars', 'GarageArea']].fillna(0.0)

In [None]:
df[['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond']].isnull().sum()

#### I've replaced all the null values in the dataframe as part of my data cleaning process. After I confirm it worked by checking the entire dataframe for null values, I can move on to the next part.

In [None]:
df.isnull().values.any()

The last column I am going to drop is the ID column. The dataframe has an index and this feature isn't necessary for answering the questions about the sales price.

In [None]:
del df['Id']

## Data Exploration

I am using pandas built-in .describe() function to get some summary statistics about the numerical data presented in the dataframe. There is some interesting information about how many unique values each column has, the top value found, the min and max for each column, and the std deviation to name a few. There are NaN values where certain descriptive statistics don't apply to that datatype.

In [None]:
df.describe(include="all").T

### I am going to create a dataframe specifically for numerical datatypes so we can explore the numerical data distribution

In [None]:
num_df = df.select_dtypes(include = ['float64', 'int64'])
num_df.head(10)

_sidenote:_ Now that I've separated the numerical data, I am noticing new features now that I'd like to take a closer look at later such as Overall Quality and Overall Condition.

### Frequency Distribution of Sale Price

In [None]:
sns.set(style="whitegrid")
sns.distplot(df['SalePrice'], kde=False, color='#F85888', bins=100)
plt.title('Sale Price', fontsize=16)
plt.xlabel('US Dollars', fontsize=14)

### Density Plot of Sale Price

In [None]:
sns.set(style="whitegrid")
sns.kdeplot(df['SalePrice'], color='#fec508', shade=True)
plt.title('Sale Price', fontsize=16)
plt.xlabel('US Dollars', fontsize=14)
plt.ylabel('Density', fontsize=14)



### Outlier Analysis

In [None]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=df['SalePrice'])

#### Z-Score

In [None]:
#shows which rows are outliers using the z score

from scipy import stats

def get_z_score(data):
    z = np.abs(stats.zscore(data))
    threshold = 3
    print(np.where(z > 3))

z_score = get_z_score(df['SalePrice'])
z_score

#### Interquartile Range (IQR)

In [None]:
Q1 = df['SalePrice'].quantile(0.25)
Q3 = df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

## Resources

- [Using pandas describe method dataframe summary](https://backtobazics.com/python/pandas-describe-method-dataframe-summary/)
- [How to make histogram in python with pandas and seaborn](https://cmdlinetips.com/2019/02/how-to-make-histogram-in-python-with-pandas-and-seaborn/)
- [Ways to Detect and Remove Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)