## Feature Engineering

## Import Necessary Libraries and Data

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
df = pd.read_csv("../data_cleaning/final_data.csv")

# Set the maximum number of rows to display
pd.set_option('display.max_rows', 100)

# Now, when you print your DataFrame, it will show up to 100 rows
print(df)

                  Variable/Indicator          Data                  Sector
0                 adult_samehome_nyc  8.946070e+01            DEMOGRAPHICS
1                  asian_api_pop_nyc  1.410903e+01            DEMOGRAPHICS
2                asian_api_total_nyc  1.184982e+06            DEMOGRAPHICS
3       asian_api_pop_nyc_historical  9.830000e+00            DEMOGRAPHICS
4           asian_api_pop_change_nyc  4.353033e+01            DEMOGRAPHICS
..                               ...           ...                     ...
343  veterans_unemployed_percent_nyc  3.925968e+00  WORK, WEALTH & POVERTY
344  unpaid_family_workers_class_nyc  1.046233e-01  WORK, WEALTH & POVERTY
345             veterans_poverty_nyc  1.598600e+04  WORK, WEALTH & POVERTY
346     veterans_poverty_percent_nyc  1.112937e+01  WORK, WEALTH & POVERTY
347     government_workers_class_nyc  1.303072e+01  WORK, WEALTH & POVERTY

[348 rows x 3 columns]


## Our Data has 3 columns that all have to be converted to be usable, 
1. Vectorize Variable/Indicator
2. Normalize Variable Value
3. One Hot encode our Category 

## 1. Converting Variable/Indicator column into a TfidfVector

What is Text Vectorization?
In simple terms, text vectorization is the process of turning text data into numerical data that a computer can understand and process. Imagine trying to feed words or sentences into a calculator; it wouldn't know what to do with them. But if you converted those words or sentences into numbers, then the calculator could work with them. That's essentially what we're doing with text vectorization

Benefits :
1. Machine Compatibility
2. Capturing Semantic Meaning
3. Dimensionality Reduction

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
## Initialize Vectorizer
vectorizer = TfidfVectorizer()

## Fit and Transform Data - Allows the vector to "learn" the variable names
variables_tfidf_matrix = vectorizer.fit_transform(df['Variable/Indicator'])

## variables_tfidf_matrix is the variable that holds matrices of learned words that could be directly inserted to our model

## Performing One Hot Encoding on Sector 

One-hot encoding is a process by which categorical variables are converted into a form that could be provided to machine learning algorithms to improve predictions.

In [25]:
# Figure out column names to identify which one is the sector column
df_cols = df.columns.tolist() # 'Sector' is the name of our sector column

# Identify Unique Sectors
unique_sectors = df['Sector'].unique()
print(unique_sectors)



['DEMOGRAPHICS' 'EDUCATION' 'ENVIRONMENT' 'FOOD SYSTEMS' 'HEALTH'
 'HOUSING & INFRASTRUCTURE' 'POLITICAL ENGAGEMENT'
 'PUBLIC FUNDING & SERVICES' 'SAFETY & SECURITY' 'WORK, WEALTH & POVERTY']


We have 10 Sectors so our one hot-encoding should return 9 columns. It drops the first column 'Demographic' to avoid collinearity. 

- A 1 means on any of the 9 sectors means it is part of that sector

- A 0 means it is not part of the sector

We will use this to cluster our data


In [26]:
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
# sparse_output=False - Output will be regular NumPy array and not a sparse matrix
# drop="first" - avoid collinearity in some modeling scenarios.
#                ex : We have Group A, B, C. We drop A,
#                     the model then assumes that if its not B or C
#                     it must be A by intuition
# Benefits: Avoiding colinearity lowers chance of overfitting

encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit Transformed Data
sector_encoded = encoder.fit_transform(df[['Sector']])


# Add to a dummy dataframe to visualize 
encoded_df = pd.DataFrame(sector_encoded, columns=encoder.get_feature_names_out(['Sector']))


In [10]:
# Visualize one hot encoding
print(encoded_df)

     Sector_EDUCATION  Sector_ENVIRONMENT  Sector_FOOD SYSTEMS  Sector_HEALTH  \
0                 0.0                 0.0                  0.0            0.0   
1                 0.0                 0.0                  0.0            0.0   
2                 0.0                 0.0                  0.0            0.0   
3                 0.0                 0.0                  0.0            0.0   
4                 0.0                 0.0                  0.0            0.0   
..                ...                 ...                  ...            ...   
343               0.0                 0.0                  0.0            0.0   
344               0.0                 0.0                  0.0            0.0   
345               0.0                 0.0                  0.0            0.0   
346               0.0                 0.0                  0.0            0.0   
347               0.0                 0.0                  0.0            0.0   

     Sector_HOUSING & INFRA

In [15]:
## Identify one-hot-encoded-columns
column_names = encoded_df.columns.tolist()
print(column_names)

['Sector_EDUCATION', 'Sector_ENVIRONMENT', 'Sector_FOOD SYSTEMS', 'Sector_HEALTH', 'Sector_HOUSING & INFRASTRUCTURE', 'Sector_POLITICAL ENGAGEMENT', 'Sector_PUBLIC FUNDING & SERVICES', 'Sector_SAFETY & SECURITY', 'Sector_WORK, WEALTH & POVERTY']


In [16]:
import openpyxl

encoded_df.to_excel('sector_one_hot_encoded.xlsx', index=False, header=column_names)