## Data Wrangling

In [47]:
# importing libraries
import pandas as pd
import numpy as np

In [48]:
# load dataframe
df = pd.read_csv(r"Raw_data.csv", encoding = 'unicode_escape', low_memory = False)

#### Q.1) Print the first five rows of the dataframe and use a suitable function to extract the column names and dimension of the dataframe.

In [49]:
df.head(5)

Unnamed: 0,Permit Number,Permit Type,Permit Type Definition,Block,Lot,Street Number,Street Number Suffix,Street Name,Street Suffix,Unit,...,Existing Units,Proposed Units,Plansets,TIDF Compliance,Existing Construction Type,Proposed Construction Type,Supervisor District,Zipcode,Location,Record ID
0,202000000000.0,4,sign - erect,326,23,140,,Ellis,St,,...,143.0,,2.0,,3.0,,3.0,94102.0,"(37.785719256680785, -122.40852313194863)",1380000000000.0
1,202000000000.0,4,sign - erect,306,7,440,,Geary,St,0.0,...,,,2.0,,3.0,,3.0,94102.0,"(37.78733980600732, -122.41063199757738)",1420000000000.0
2,202000000000.0,3,additions alterations or repairs,595,203,1647,,Pacific,Av,,...,39.0,39.0,2.0,,1.0,1.0,3.0,94109.0,"(37.7946573324287, -122.42232562979227)",1420000000000.0
3,202000000000.0,8,otc alterations permit,156,11,1230,,Pacific,Av,0.0,...,1.0,1.0,2.0,,5.0,5.0,3.0,94109.0,"(37.79595867909168, -122.41557405519474)",1440000000000.0
4,202000000000.0,6,demolitions,342,1,950,,Market,St,,...,,,2.0,,3.0,,6.0,94102.0,"(37.78315261897309, -122.40950883997789)",145000000000.0


#### Q.2) Extract the datatype of each column

In [50]:
df.dtypes

Permit Number                     object
Permit Type                        int64
Permit Type Definition            object
Block                             object
Lot                               object
Street Number                      int64
Street Number Suffix              object
Street Name                       object
Street Suffix                     object
Unit                             float64
Unit Suffix                       object
Number of Existing Stories       float64
Number of Proposed Stories       float64
Voluntary Soft-Story Retrofit     object
Estimated Cost                   float64
Revised Cost                     float64
Existing Units                   float64
Proposed Units                   float64
Plansets                         float64
TIDF Compliance                   object
Existing Construction Type       float64
Proposed Construction Type       float64
Supervisor District              float64
Zipcode                          float64
Location        

#### Q.3) What is the count of the missing values in each column of the dataframe? Also find the overall total missing values and convert it in percentage.

In [51]:
# missing values in each column
df.isnull().sum()

Permit Number                         0
Permit Type                           0
Permit Type Definition                0
Block                                 0
Lot                                   0
Street Number                         0
Street Number Suffix             196684
Street Name                           0
Street Suffix                      2768
Unit                             169421
Unit Suffix                      196939
Number of Existing Stories        42784
Number of Proposed Stories        42868
Voluntary Soft-Story Retrofit    198865
Estimated Cost                    38066
Revised Cost                       6066
Existing Units                    51538
Proposed Units                    50911
Plansets                          37309
TIDF Compliance                  198898
Existing Construction Type        43366
Proposed Construction Type        43162
Supervisor District                1717
Zipcode                            1716
Location                           1700


In [52]:
# overall missing values
total_null = 0
for num in df.isnull().sum():
    total_null += num
total_null

1324778

In [53]:
# percentage of missing values
total = df.size
per = total_null/total * 100
print("{:.2f}%".format(per))

25.62%


#### Q.4) What is dimension of the dataframe after removing the rows with missing values? Is the dimension same after removing columns with missing values?

In [54]:
# dimention after removing rows with missing values
temp = df.dropna(axis = 0)
temp.shape

(0, 26)

In [55]:
# dimention after removing columns with missing values
temp2 = df.dropna(axis = 1)
temp2.shape

(198900, 8)

No, the dimentions are not the same.

#### Q.5) Drop the entirely empty columns and then impute the new dataframe with forward and backward fill. Also, extract the names of the columns that were dropped. Next, obtain a fresh new dataframe and impute the missing values with the mean of the column.

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198900 entries, 0 to 198899
Data columns (total 26 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Permit Number                  198900 non-null  object 
 1   Permit Type                    198900 non-null  int64  
 2   Permit Type Definition         198900 non-null  object 
 3   Block                          198900 non-null  object 
 4   Lot                            198900 non-null  object 
 5   Street Number                  198900 non-null  int64  
 6   Street Number Suffix           2216 non-null    object 
 7   Street Name                    198900 non-null  object 
 8   Street Suffix                  196132 non-null  object 
 9   Unit                           29479 non-null   float64
 10  Unit Suffix                    1961 non-null    object 
 11  Number of Existing Stories     156116 non-null  float64
 12  Number of Proposed Stories    

It can be seen that there are no columns that are entirely null. Next, we impute the missing values of the columns that have null values with the mean value.

In [None]:
for column in df.columns:
    if df[column].dtype == 'int64' or df[column].dtype == 'float64':
        df[column].fillna(df[column].mean(), inplace = True)

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198900 entries, 0 to 198899
Data columns (total 26 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Permit Number                  198900 non-null  object 
 1   Permit Type                    198900 non-null  int64  
 2   Permit Type Definition         198900 non-null  object 
 3   Block                          198900 non-null  object 
 4   Lot                            198900 non-null  object 
 5   Street Number                  198900 non-null  int64  
 6   Street Number Suffix           2216 non-null    object 
 7   Street Name                    198900 non-null  object 
 8   Street Suffix                  196132 non-null  object 
 9   Unit                           198900 non-null  float64
 10  Unit Suffix                    1961 non-null    object 
 11  Number of Existing Stories     198900 non-null  float64
 12  Number of Proposed Stories    

#### Q.6) Calculate the mean and median for the estimated cost and revised cost. What is your observation?

In [60]:
print('Mean of Estimated Cost:', df['Estimated Cost'].mean())
print('Median of Estimated Cost:', df['Estimated Cost'].median())

Mean of Estimated Cost: 168955.44329681536
Median of Estimated Cost: 20000.0


In [61]:
print('Mean of Revised Cost:', df['Revised Cost'].mean())
print('Median of Revised Cost:', df['Revised Cost'].median())

Mean of Revised Cost: 132856.1864917494
Median of Revised Cost: 8000.0


There is a significant difference between the mean and median of the estimated costs and revised costs. This indicates that the actual costs are much lower than what was initially estimated.

#### Q.7) Create a new cleaned dataframe after applying backward/forward fill with two column names only, namely the Revised cost and the Estimated Cost. Transform the dataframe using the following normalization techniques: min-max scaling and z-score.

In [62]:
new_df = df[['Estimated Cost', 'Revised Cost']]

In [63]:
new_df = new_df.ffill()
new_df = new_df.bfill()

In [64]:
# min-max
mms_df = (new_df - new_df.min()) / (new_df.max() - new_df.min())

# z-score
zs_df = (new_df - new_df.mean()) / new_df.std()

In [65]:
mms_df.head(5)

Unnamed: 0,Estimated Cost,Revised Cost
0,7e-06,5.12492e-06
1,0.0,6.40615e-07
2,3.7e-05,0.0001702193
3,4e-06,2.56246e-06
4,0.000186,0.000128123


In [66]:
zs_df.head(5)

Unnamed: 0,Estimated Cost,Revised Cost
0,-0.050529,-0.0365051
1,-0.051754,-0.03749665
2,-0.045628,-4.592547e-15
3,-0.051142,-0.0370717
4,-0.021122,-0.009308194
