#### About the Analysis:

##### The main purpose of this notebook is to explore some of the features with respect to LIHTC( Low Income Housing Tax Credit )  program data for Florida and compare it with the rest of the USA. 

##### But before doing that, we first loaded the NHPD data and checked the overall structure of the dataset. We also found that there are many features that have around 80-100% NAs. 

##### We then checked the top 10 cities, counties and states that are present in the dataset and did a small univariate and bivariate analysis of features such as occupancy rates, fair market rate etc. before jumping to the LIHTC program data columns for Florida and compared them with rest of the USA.

#### Findings:

##### 1. Although we didnt see much of a strong trend when comparing the occupancy rate w.r.t fair market rent, but we did find some clusters that shows the occupancy rate is high when the fair market rate is less. 

##### 2. For Florida there are around 120 LIHTC assisted units that are covered with subsidies on an average, whereas for the rest of the USA, its around 60

##### 3. On an average there are just 2-4 active LIHTC subsidies per city in Florida and when compared with the rest of the USA the numbers are almost the same per city

In [338]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
nhpd = pd.read_excel('Active and Inconclusive Properties.xlsx')

In [None]:
nhpd.shape

In [None]:
# Let's find the percent of missing values in each column. 

def find_missing(df: pd.DataFrame):
    """
    Calculates the percent of missing values in a given dataset
    param df: The dataframe for which we want to calculate the percent of missing values in each column
    """
    
    missing = df.isnull().sum()
    
    percent_missing = missing * 100/ len(df)
    
    missing_df = pd.DataFrame({'col': df.columns, 'percent_missing': percent_missing})
    
    #sort them in decreasing order
    missing_df.sort_values(by = 'percent_missing', ascending= False, inplace=True)
    
    return missing_df
    
    

missing_values = find_missing(nhpd)


In [None]:
# number of columns with more than 90% missing data
df_subset_90 = missing_values[missing_values['percent_missing'] >= 90]
df_subset_90['col'].nunique()

In [None]:
# number of columns with less than 30% missing data
df_subset_30 = missing_values[missing_values['percent_missing'] <= 30]
df_subset_30['col'].nunique()

In [None]:
# number of columns with less than 100% missing data
df_subset_100 = missing_values[missing_values['percent_missing'] == 100]
df_subset_100['col'].nunique()

In [None]:
no_missing_list = df_subset_90['col'].unique().tolist()

In [None]:
nhpd_subset = nhpd.loc[:, nhpd.columns.isin(no_missing_list)]
nhpd_subset.columns

#### Analyzing some important features on the overall data such as occupancy rate, fair market rate etc. to get some initial insights

In [None]:
def univariate_analysis(df:pd.DataFrame, col: str):
    """
    Takes a dataframe and a column name to plot the 
    param df: dataframe
    param col: column name for which we want the viz
    """
    
    df[col].value_counts().head(10).sort_values(ascending=True).plot.barh()    

In [None]:
# Top ten cities present in the dataset
univariate_analysis(nhpd, ['City'])

In [None]:
# Top ten counties present in the dataset
univariate_analysis(nhpd, ['County'])

In [None]:
# Top ten states present in the dataset
univariate_analysis(nhpd, ['State'])

#### occupancy rate Vs fair market rate: Although we didnt see much of a strong trend when comparing the occupancy rate w.r.t fair market rent, but we did find some clusters that shows the occupancy rate is high when the fair market rate is less. 

In [None]:
sns.regplot(data=nhpd, x="FairMarketRent_2BR", y="OccupancyRate")

In [None]:
sns.regplot(data=nhpd, x="FairMarketRent_2BR", y="AverageMonthsOfTenancy")

##### Next we are going to figure the Fair market rent for target tenant type:

In [None]:
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

def match_and_replace(df, column, string_to_match, min_ratio = 50):
    unique_strings = df[column].unique()
    get_matches = fuzzywuzzy.process.extract(string_to_match, unique_strings, 
                                         limit=20, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
    matches = [get_matches[0] for get_matches in get_matches if get_matches[1] >= min_ratio]

    rows_with_matches = df[column].isin(matches)
    df.loc[rows_with_matches, column] = string_to_match
    print("Replaced!")

In [None]:
match_and_replace(nhpd, 'TargetTenantType', 'Mixed')
match_and_replace(nhpd, 'TargetTenantType', 'Mixed;Link')
match_and_replace(nhpd, 'TargetTenantType', 'Elderly or Disabled')
match_and_replace(nhpd, 'TargetTenantType', 'Family')
match_and_replace(nhpd, 'TargetTenantType', 'Elderly')
fig, ax = plt.subplots(figsize=(12,12))
ax = sns.boxplot(y="TargetTenantType", x="FairMarketRent_2BR", data=nhpd)

#### Let's explore the LIHTC i.e. the 'Low Income Housing Tax Credit' housing subsidies for Florida and compare them with rest of the USA

In [None]:
# Lets explore the the florida data
florida_df = nhpd[nhpd['State'] == 'FL']
florida_df.head()

In [None]:
# Get the rest of US data for comparison with Florida data
rest_US = nhpd[nhpd['State'] != 'FL']
rest_US.head()

In [None]:
# Total number of units covered by subsidy for florida and the rest of US

LIHTC_florida = florida_df[['City', 'LIHTC_1_AssistedUnits', 'NumberActiveLihtc', 'NumberInconclusiveLihtc', 
                            'NumberInactiveLihtc', 'LIHTC_2_AssistedUnits', 'LIHTC_1_ProgramName', 
                            'TargetTenantType', 'FairMarketRent_2BR']]


LIHTC_florida.dropna(inplace=True)

In [None]:
# On an average how many LIHTC units are covered by subsidies in some of the cities in Florida
# Below is the list of cities with LIHTC assisted units for florida 
fig, ax = plt.subplots(figsize=(10,8))
LIHTC_florida.groupby('City')['LIHTC_1_AssistedUnits'].mean().sort_values().plot.barh()

###### For Florida we can see that around 120 LIHTC assisted units are covered with subsidies on an average

In [None]:
LIHTC_florida.groupby('City')['LIHTC_1_AssistedUnits'].mean().describe()

###### When comparing with the rest of the US we can see that around 60 LIHTC assisted units are covered with subsidies on an average

In [None]:
# Total number of units covered by subsidy for florida and the rest of US

LIHTC_rest_US = rest_US[['City', 'LIHTC_1_AssistedUnits', 'LIHTC_2_AssistedUnits', 
                         'NumberActiveLihtc', 'NumberInconclusiveLihtc', 
                        'NumberInactiveLihtc', 'LIHTC_1_ProgramName', 
                         'TargetTenantType', 'FairMarketRent_2BR']]


LIHTC_rest_US.dropna(inplace=True)

LIHTC_rest_US.groupby('City')['LIHTC_1_AssistedUnits'].mean().describe()

###### Number of active LIHTC subsidies in Florida grouped by city- On an average there are just 2-4 active LIHTC subsidies for cities in Florida

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
LIHTC_florida.groupby('City')['NumberActiveLihtc'].mean().sort_values().plot.barh()

In [None]:
LIHTC_florida.groupby('City')['NumberActiveLihtc'].mean().describe()

###### Number of active LIHTC subsidies for rest of the US grouped by city- On an average there are just 2-5 active LIHTC subsidies for cities all over the USA

In [None]:
LIHTC_rest_US.groupby('City')['NumberActiveLihtc'].mean().describe()