### Data Pre-Processing


 **Collecting the data -** data consists of budget text documents in the form of PDF files obtained from the following organizations: 

   * [Guilford County](https://www.guilfordcountync.gov/home/showdocument?id=9497)
   * [Durham County](https://www.dconc.gov/home/showdocument?id=27985)
   * [City of Durham](https://durhamnc.gov/DocumentCenter/View/27412/FY20-Final-Budget)
   * [City of Charlotte](https://charlottenc.gov/budget/FY2020%20Documents/FY%202020%20Adopted%20Budget%20Book%207-31%20Complete.pdf)
   * [Mecklenburg County](https://www.mecknc.gov/CountyManagersOffice/OMB/Documents/FY2020%20Adopted%20Budget.pdf) <br/>
   * [Wake County](http://www.wakegov.com/budget/fy20/Documents/FY20%20Adopted%20Budget%20Book.pdf)
   * [City of Raleigh](https://user-2081353526.cld.bz/FY2020AdoptedBudget)
   
After the PDF files are collected, they are compressed to reduce the size. Then, files are converted into CSV files using an app developed by project mentor:
           **[Jason Jones](https://www.linkedin.com/in/jones-jason-adam/),**
           **click [here](https://jason-jones.shinyapps.io/Emotionizer/) for the App**

In [10]:
#Importing packages
import os
import glob
import nltk
import pandas as pd
import numpy as np

In [11]:
# change the current directory to read the data
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\FY2020\structured\original") 

#### Reading and labling data for all organizations

In [12]:
# 1- Reading Guilford-County data file 
GC_df = pd.read_csv("GuilfordCountyOriginalDataFY20.csv", engine='python')
# inserting "organization" column with static value 
# corresponding to the organization in question 
GC_df.insert(2, "organization", "Guilford County")


# 2- For Charlotte-City data
CC_df = pd.read_csv(r'CharlotteCityOriginalDataFY20.csv', engine='python')
CC_df.insert(2, "organization", "Charlotte City")

# 3- For Durham-City data
DCity_df = pd.read_csv(r'DurhamCityOriginalDataFY20.csv', engine='python')
DCity_df.insert(2, "organization", "Durham City")

# 4- For Durham-County data
DCounty_df = pd.read_csv(r'DurhamCountyOriginalDataFY20.csv', engine='python')
DCounty_df.insert(2, "organization", "Durham County")

# 5- For Mecklenburg-County data
MC_df = pd.read_csv(r'MecklenburgCountyOriginalDataFY20.csv', engine='python')
MC_df.insert(2, "organization", "Mecklenburg County")

# 6- For Raleigh-City data
RC_df = pd.read_csv(r'RaleighCityOriginalDataFY20.csv', engine='python')
RC_df.insert(2, "organization", "Raleigh City")

# 7- For Wake-County data
WC_df = pd.read_csv(r'WakeCountyOriginalDataFY20.csv', engine='python')
WC_df.insert(2, "organization", "Wake County")


In [13]:
# Combine all dataframes into a single dataframe using concat() function
# Row lables are adjusted automaticlly by passing ignore_index=True
df =  pd.concat([GC_df, CC_df, DCity_df, 
                   DCounty_df, MC_df, RC_df, WC_df], ignore_index=True)
df.head()

Unnamed: 0.1,Unnamed: 0,page_number,organization,word
0,1,2,Guilford County,guilford
1,2,2,Guilford County,county
2,3,2,Guilford County,by
3,4,2,Guilford County,the
4,5,2,Guilford County,numbers


In [14]:
# listing columns in data frame 
list(df)

['Unnamed: 0', 'page_number', 'organization', 'word']

#### Dropping and reordering columns

In [15]:
# delete columns using the columns parameter of drop
df = df.drop(columns="Unnamed: 0")

# re-order columns
df = df[['page_number','word','organization']]

df.head()

Unnamed: 0,page_number,word,organization
0,2,guilford,Guilford County
1,2,county,Guilford County
2,2,by,Guilford County
3,2,the,Guilford County
4,2,numbers,Guilford County


#### Adding "Year" column with a static value corresponding to the year in question

In [16]:
df.insert(3, "year", "FY2020")
df.head()

Unnamed: 0,page_number,word,organization,year
0,2,guilford,Guilford County,FY2020
1,2,county,Guilford County,FY2020
2,2,by,Guilford County,FY2020
3,2,the,Guilford County,FY2020
4,2,numbers,Guilford County,FY2020


#### Dataframe to one single and clean csv file 

In [17]:
# Change the dirctory for file to be stored properly
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData") 

# Export dataframe to csv
df.to_csv(r'DataFY20.csv', index=False, encoding='utf-8-sig')

In [18]:
# Change the dirctory for pickle file to be stored properly
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData\pickle") 
# Let's pickle it for later use
df.to_pickle("DataFY20.pkl")