### Data Pre-Processing


1. **Obtaining the data -** data consists of budget text documents in the form of PDF files obtained from the following organizations: 

   * [Guilford County](https://www.guilfordcountync.gov/home/showdocument?id=9497)
   * [Durham County](https://www.dconc.gov/home/showdocument?id=27985)
   * [City of Durham](https://durhamnc.gov/DocumentCenter/View/27412/FY20-Final-Budget)
   * [City of Charlotte](https://charlottenc.gov/budget/FY2020%20Documents/FY%202020%20Adopted%20Budget%20Book%207-31%20Complete.pdf)
   * [Mecklenburg County](https://www.mecknc.gov/CountyManagersOffice/OMB/Documents/FY2020%20Adopted%20Budget.pdf) <br/>
   * [Wake County](http://www.wakegov.com/budget/fy20/Documents/FY20%20Adopted%20Budget%20Book.pdf)
   * [City of Raleigh](https://user-2081353526.cld.bz/FY2020AdoptedBudget)
   
After the PDF files are collected, they are compressed to reduce the size getting them ready for tokenizations and conversion to CSV files using an app developed by project mentor:
           **[Jason Jones](https://www.linkedin.com/in/jones-jason-adam/),**
           **click [here](https://jason-jones.shinyapps.io/Emotionizer/) for the App**
       
2. **Cleaning the data -** performing some popular text pre-processing techniques


3. **Organizing the data -** organizing the cleaned data into a way that is easy to input into other algorithms

### FY2020 Data Preprocessig Starts Here

In [3]:
#Importing packages
import os
import glob
import nltk
import pandas as pd
import numpy as np

AttributeError: module 'numpy' has no attribute 'testing'

In [None]:
# change the current directory to read the data
os.chdir(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\FY2020\structured\original") 

### 1- Obtaining the data

#### Reading and labling data for all organizations

In [None]:
# 1- Reading Guilford-County data file 
GC_df = pd.read_csv("GuilfordCountyOriginalDataFY20.csv", engine='python')
# inserting "organization" column with static value 
# corresponding to the organization in question 
GC_df.insert(2, "organization", "Guilford County")


# 2- For Charlotte-City data
CC_df = pd.read_csv(r'CharlotteCityOriginalDataFY20.csv', engine='python')
CC_df.insert(2, "organization", "Charlotte City")

# 3- For Durham-City data
DCity_df = pd.read_csv(r'DurhamCityOriginalDataFY20.csv', engine='python')
DCity_df.insert(2, "organization", "Durham City")

# 4- For Durham-County data
DCounty_df = pd.read_csv(r'DurhamCountyOriginalDataFY20.csv', engine='python')
DCounty_df.insert(2, "organization", "Durham County")

# 5- For Mecklenburg-County data
MC_df = pd.read_csv(r'MecklenburgCountyOriginalDataFY20.csv', engine='python')
MC_df.insert(2, "organization", "Mecklenburg County")

# 6- For Raleigh-City data
RC_df = pd.read_csv(r'RaleighCityOriginalDataFY20.csv', engine='python')
RC_df.insert(2, "organization", "Raleigh City")

# 7- For Wake-County data
WC_df = pd.read_csv(r'WakeCountyOriginalDataFY20.csv', engine='python')
WC_df.insert(2, "organization", "Wake County")


In [None]:
# Combine all data frames into a single data frame using concat() 
# function in pandas. Row lables are adjusted automaticlly 
# by passing ignore_index=True
data =  pd.concat([GC_df, CC_df, DCity_df, 
                   DCounty_df, MC_df, RC_df, WC_df], ignore_index=True)
data

### 2- Cleaning the Data

In [None]:
# listing columns in data frame 
list(data)

#### Dropping and reordering columns

In [None]:
# delete columns using the columns parameter of drop
data = data.drop(columns="Unnamed: 0")

# re-order columns
data = data[['page_number','word','organization']]

data.head()

#### Adding "Year" column with a static value corresponding to the year in question

In [None]:
data.insert(3, "year", "FY2020")
data.head()

####  Text normalization:

* ##### Lowercasing

In [None]:
# using a function to lowercase all text entries in column 'word'
data['word'] = data['word'].apply(lambda x: " ".join(x.lower() 
                                                     for x in x.split()))

data.head()

* ##### Lemmatization
* You may need to shut down the kernel and run jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

In [None]:
wn = nltk.WordNetLemmatizer()
tokens = data['word']

tokens = [wn.lemmatize(t) for t in tokens]

print(tokens)

#### Removing stop words

In [None]:
# Load library
from nltk.corpus import stopwords

# Download the set of stop words the first time
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

data['word'] = data['word'].apply(lambda x: " ".join(x 
                            for x in x.split() if x not in stop_words))
data['word'].head()

### 2- Organizing the Data

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [None]:
# Let's take a look at our dataframe
data

#### Dataframe to one single and clean csv file 

In [None]:
# Export dataframe to csv
data.to_csv(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\PreprocessedOriginalData\PreprocessedOriginalDataFY20.csv", index=False, encoding='utf-8-sig')

### FY2019 Data Preprocessing