### Data Pre-Processing


1. **Collecting the data -** data consists of budget text documents in the form of PDF files obtained from the following organizations: 

   * [Guilford County](https://www.guilfordcountync.gov/our-county/budget-management-evaluation)
   * [Durham County](https://www.dconc.gov/government/departments-a-e/budget-management-services)
   * [City of Durham](https://durhamnc.gov/199/Budget-Management-Services)
   * [City of Charlotte](https://charlottenc.gov/budget/Pages/default.aspx)
   * [Mecklenburg County](https://www.mecknc.gov/CountyManagersOffice/OMB/Pages/Home.aspx)
   * [Wake County](http://www.wakegov.com/budget/Pages/default.aspx)
   * [City of Raleigh](https://www.raleighnc.gov/home/content/Departments/Articles/BudgetManagement.html)
   
After the PDF files are collected, they are compressed to reduce the their sizes. Then, the files are tokenized, and converted into CSV files using an app developed by project mentor:
           **[Jason Jones](https://www.linkedin.com/in/jones-jason-adam/),**
           **click [here](https://jason-jones.shinyapps.io/Emotionizer/) for the App**
       
2. **Cleaning the data -** performing some popular text pre-processing techniques


3. **Organizing the data -** organizing the cleaned data into a way that is easy to input into other algorithms

In [1]:
import os
import glob
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Change the current directory to read the data
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData") 

### 1- Obtaining the data

#### Reading the csv files into dataframes

In [3]:
# Reading FY13-FY20 data files into pandas dataframes
FY13_df = pd.read_csv(r'PreprocessedOriginalDataFY13.csv', engine='python')
FY14_df = pd.read_csv(r'PreprocessedOriginalDataFY14.csv', engine='python')
FY15_df = pd.read_csv(r'PreprocessedOriginalDataFY15.csv', engine='python')
FY16_df = pd.read_csv(r'PreprocessedOriginalDataFY16.csv', engine='python')
FY17_df = pd.read_csv(r'PreprocessedOriginalDataFY17.csv', engine='python')
FY18_df = pd.read_csv(r'PreprocessedOriginalDataFY18.csv', engine='python')
FY19_df = pd.read_csv(r'PreprocessedOriginalDataFY19.csv', engine='python')
FY20_df = pd.read_csv(r'PreprocessedOriginalDataFY20.csv', engine='python')

In [4]:
# Combine all dataframes into a single dataframe using concat() function
# Row lables are adjusted automaticlly by passing ignore_index=True
data =  pd.concat([FY13_df, FY14_df, FY15_df, FY16_df, FY17_df, 
                  FY18_df, FY19_df, FY20_df], ignore_index=True)
data

Unnamed: 0,ï»¿page_number,word,organization,year
0,3,fiscal,Guilford County,FY2013
1,3,year,Guilford County,FY2013
2,3,adopted,Guilford County,FY2013
3,3,budget,Guilford County,FY2013
4,3,brenda,Guilford County,FY2013
...,...,...,...,...
4898105,498,index,Wake County,FY2020
4898106,498,fiscal,Wake County,FY2020
4898107,498,year,Wake County,FY2020
4898108,498,adopted,Wake County,FY2020


### 2- Cleaning the Data

In [5]:
# listing columns in data frame 
list(data)

['ï»¿page_number', 'word', 'organization', 'year']

#### Dropping and reordering columns

In [6]:
# delete columns using the columns parameter of drop
data = data.drop(columns="ï»¿page_number")
data.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


#### Before we do any cleaning, let's check some numbers

In [18]:
# Regex with the pattern replacing not alphanumeric/whitespace 
data['word'] = data['word'].str.replace('[^\w\s]','')
data.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


In [19]:
# Replacing empty cells with null values
data['word'].replace('', np.nan, inplace=True)

# Drop all NaN values from the data frames
data.dropna(subset=['word'], inplace=True)
data.head()

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013


### 2- Organizing the Data

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [20]:
# Let's take a look at our dataframe
data

Unnamed: 0,word,organization,year
0,fiscal,Guilford County,FY2013
1,year,Guilford County,FY2013
2,adopted,Guilford County,FY2013
3,budget,Guilford County,FY2013
4,brenda,Guilford County,FY2013
...,...,...,...
3528440,index,Wake County,FY2020
3528441,fiscal,Wake County,FY2020
3528442,year,Wake County,FY2020
3528443,adopted,Wake County,FY2020


#### Dataframe to one single and clean csv file 

In [21]:
# Export dataframe to csv
data.to_csv(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData\PreprocessedOriginalDataAll.csv", index=False, encoding='utf-8-sig')

In [22]:
# Change the dirctory for pickle file to be stored properly
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData\pickle") 
# Let's pickle it for later use
data.to_pickle("Data.pkl")