### Data Pre-Processing


1. **Collecting the data -** data consists of budget text documents in the form of PDF files obtained from the following organizations: 

   * [Guilford County](https://www.guilfordcountync.gov/our-county/budget-management-evaluation)
   * [Durham County](https://www.dconc.gov/government/departments-a-e/budget-management-services)
   * [City of Durham](https://durhamnc.gov/199/Budget-Management-Services)
   * [City of Charlotte](https://charlottenc.gov/budget/Pages/default.aspx)
   * [Mecklenburg County](https://www.mecknc.gov/CountyManagersOffice/OMB/Pages/Home.aspx)
   * [Wake County](http://www.wakegov.com/budget/Pages/default.aspx)
   * [City of Raleigh](https://www.raleighnc.gov/home/content/Departments/Articles/BudgetManagement.html)
   
After the PDF files are collected, they are compressed to reduce the their sizes. Then, the files are tokenized, and converted into CSV files using an app developed by project mentor:
           **[Jason Jones](https://www.linkedin.com/in/jones-jason-adam/),**
           **click [here](https://jason-jones.shinyapps.io/Emotionizer/) for the App**
       
2. **Cleaning the data -** performing some popular text pre-processing techniques


3. **Organizing the data -** organizing the cleaned data into a way that is easy to input into other algorithms

In [1]:
import os
import glob
import nltk
import pandas as pd
import numpy as np

In [2]:
# Change the current directory to read the data
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData") 

### 1- Obtaining the data

#### Transforming the csv files into dataframes

In [3]:
# Reading FY13-FY20 data files into pandas dataframes
FY13_df = pd.read_csv(r'PreprocessedOriginalDataFY13.csv', engine='python')
FY14_df = pd.read_csv(r'PreprocessedOriginalDataFY14.csv', engine='python')
FY15_df = pd.read_csv(r'PreprocessedOriginalDataFY15.csv', engine='python')
FY16_df = pd.read_csv(r'PreprocessedOriginalDataFY16.csv', engine='python')
FY17_df = pd.read_csv(r'PreprocessedOriginalDataFY17.csv', engine='python')
FY18_df = pd.read_csv(r'PreprocessedOriginalDataFY18.csv', engine='python')
FY19_df = pd.read_csv(r'PreprocessedOriginalDataFY19.csv', engine='python')
FY20_df = pd.read_csv(r'PreprocessedOriginalDataFY20.csv', engine='python')

In [4]:
# Combine all dataframes into a single dataframe using concat() function
# Row lables are adjusted automaticlly by passing ignore_index=True

df =  pd.concat([FY13_df, FY14_df, FY15_df, FY16_df, FY17_df, 
                  FY18_df, FY19_df, FY20_df], ignore_index=True)
df.head()

Unnamed: 0,ï»¿page_number,word,organization,year
0,3,fiscal,Guilford County,FY2013
1,3,year,Guilford County,FY2013
2,3,adopted,Guilford County,FY2013
3,3,budget,Guilford County,FY2013
4,3,brenda,Guilford County,FY2013


### 2- Cleaning the Data

In [5]:
# listing columns in data frame 
list(df)

['ï»¿page_number', 'word', 'organization', 'year']

#### Dropping and reordering columns

In [6]:
# delete columns using the columns parameter of drop
df = df.drop(columns="ï»¿page_number")
df.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


#### Removing stop words

In [7]:
# Import stop words from nltk 
from nltk.corpus import stopwords
# Define variable stop
stop = stopwords.words('english')

df['word'] = df['word'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


#### Lowercasing

In [8]:
df['word'] = df['word'].str.lower()
df.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


#### Stemming

In [9]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [10]:
df['text_lemmatized'] = df.word.apply(lemmatize_text)

#### Removing punctuations

In [11]:
df['word'] = df['word'].str.replace('[^\w\s]','')
df['text_lemmatized'] = df['text_lemmatized'].str.replace('[^\w\s]','')
df.head(50)   

Unnamed: 0,word,organization,year,text_lemmatized
0,fiscal,Guilford County,FY2013,
1,year,Guilford County,FY2013,
2,adopted,Guilford County,FY2013,
3,budget,Guilford County,FY2013,
4,brenda,Guilford County,FY2013,
5,jones,Guilford County,FY2013,
6,fox,Guilford County,FY2013,
7,county,Guilford County,FY2013,
8,manager,Guilford County,FY2013,
9,sharisse,Guilford County,FY2013,


#### Handling missing text data 

In [12]:
# Replace any empty strings in the 'word' column with np.nan objects
df['word'].replace('', np.nan, inplace=True)

# Drop all NaN values
df.dropna(subset=['word'], inplace=True)
df.head(30)

Unnamed: 0,word,organization,year,text_lemmatized
0,fiscal,Guilford County,FY2013,
1,year,Guilford County,FY2013,
2,adopted,Guilford County,FY2013,
3,budget,Guilford County,FY2013,
4,brenda,Guilford County,FY2013,
5,jones,Guilford County,FY2013,
6,fox,Guilford County,FY2013,
7,county,Guilford County,FY2013,
8,manager,Guilford County,FY2013,
9,sharisse,Guilford County,FY2013,


### 2- Organizing the Data

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
df.head()

#### Dataframe to one single and clean csv file 

In [None]:
# Change the dirctory for file to be exported to the proper folder
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData") 

# Export dataframe to csv
df.to_csv(r"CombinedData.csv", index=False, encoding='utf-8-sig')

In [None]:
# Change the dirctory for file to be exported to the proper folder
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData\pickle") 

# Let's pickle it for later use
df.to_pickle("CombinedData.pkl")