### Data Pre-Processing


1. **Obtaining the data -** data is loaded from local machine


2. **Cleaning the data -** preforming some text pre-processing techniques


3. **Organizing the data -** organizing the cleaned data into a way that is easy to input into other algorithms

In [1]:
#Importing packages
import os
import glob
import nltk
import pandas as pd
import numpy as np

### 1- Obtaining the data

#### Reading and labling data for all organizations

In [4]:
# 1- Reading Guilford-County data file 
GC_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\GuilfordCounty_original_data.csv', engine='python')
# inserting "organization" column with static value corresponding to the organization in question 
GC_df.insert(2, "organization", "Guilford County")


# 2- For Charlotte-City data
CC_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\CharlotteCity_original_data.csv', engine='python')
CC_df.insert(2, "organization", "Charlotte City")

# 3- For Durham-City data
DCity_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\DurhamCity_original_data.csv', engine='python')
DCity_df.insert(2, "organization", "Durham City")

# 4- For Durham-County data
DCounty_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\DurhamCounty_original_data.csv', engine='python')
DCounty_df.insert(2, "organization", "Durham County")

# 5- For Mecklenburg-County data
MC_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\MecklenburgCounty_original_data.csv', engine='python')
MC_df.insert(2, "organization", "Mecklenburg County")

# 6- For Raleigh-City data
RC_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\RaleighCity_original_data.csv', engine='python')
RC_df.insert(2, "organization", "Raleigh City")

# 7- For Wake-County data
WC_df = pd.read_csv(r'C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\2019-2020\structured\original\WakeCounty_original_data.csv', engine='python')
WC_df.insert(2, "organization", "Wake County")

In [27]:
#combine all data frames into a single data frame using concat() 
#function in pandas. Row lables are adjusted automaticlly by passing ignore_index=True
data =  pd.concat([GC_df, CC_df, DCity_df, DCounty_df, MC_df, RC_df, WC_df], ignore_index=True)
data

Unnamed: 0.1,Unnamed: 0,page_number,organization,word
0,1,2,Guilford County,guilford
1,2,2,Guilford County,county
2,3,2,Guilford County,by
3,4,2,Guilford County,the
4,5,2,Guilford County,numbers
...,...,...,...,...
638126,122012,498,Wake County,index
638127,122013,498,Wake County,fiscal
638128,122014,498,Wake County,year
638129,122015,498,Wake County,adopted


### 2- Cleaning the Data

In [28]:
# listing columns in data frame 
list(data)

['Unnamed: 0', 'page_number', 'organization', 'word']

#### Dropping and reordering columns

In [29]:
# delete columns using the columns parameter of drop
data = data.drop(columns="Unnamed: 0")

# re-order columns
data = data[['page_number','word','organization']]

data.head()

Unnamed: 0,page_number,word,organization
0,2,guilford,Guilford County
1,2,county,Guilford County
2,2,by,Guilford County
3,2,the,Guilford County
4,2,numbers,Guilford County


#### Adding "Year" column with a static value corresponding to the year in question

In [30]:
data.insert(3, "year", "2019-2020")
data.head()

Unnamed: 0,page_number,word,organization,year
0,2,guilford,Guilford County,2019-2020
1,2,county,Guilford County,2019-2020
2,2,by,Guilford County,2019-2020
3,2,the,Guilford County,2019-2020
4,2,numbers,Guilford County,2019-2020


####  Lowercasing the words

In [31]:
# using a function to lowercase all text in the two cols
data['word'] = data['word'].apply(lambda x: " ".join(x.lower() for x in x.split()))
data['organization'] = data['organization'].apply(lambda x: " ".join(x.lower() for x in x.split()))
data.head()

Unnamed: 0,page_number,word,organization,year
0,2,guilford,guilford county,2019-2020
1,2,county,guilford county,2019-2020
2,2,by,guilford county,2019-2020
3,2,the,guilford county,2019-2020
4,2,numbers,guilford county,2019-2020


#### Lemmatization

In [32]:
# Please enter code to lemmatize data col 'word' in the dataframe

data.head()

Unnamed: 0,page_number,word,organization,year
0,2,guilford,guilford county,2019-2020
1,2,county,guilford county,2019-2020
2,2,by,guilford county,2019-2020
3,2,the,guilford county,2019-2020
4,2,numbers,guilford county,2019-2020


#### Removing stop words

In [33]:
# Load library
from nltk.corpus import stopwords

# Download the set of stop words the first time
nltk.download('stopwords')

# Load stop words
stop_words = stopwords.words('english')

data['word'] = data['word'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))
data['word'].head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    guilford
1      county
2            
3            
4     numbers
Name: word, dtype: object

### 2- Organizing the Data

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [34]:
# Let's take a look at our dataframe
data

Unnamed: 0,page_number,word,organization,year
0,2,guilford,guilford county,2019-2020
1,2,county,guilford county,2019-2020
2,2,,guilford county,2019-2020
3,2,,guilford county,2019-2020
4,2,numbers,guilford county,2019-2020
...,...,...,...,...
638126,498,index,wake county,2019-2020
638127,498,fiscal,wake county,2019-2020
638128,498,year,wake county,2019-2020
638129,498,adopted,wake county,2019-2020


#### Dataframe to one single and clean csv file 

In [None]:
# Export dataframe to csv
data.to_csv(r"C:\Users\Sultan\Documents\GitHub\Budget_Text_Analysis\util\data\Preprocessed_Data\PreprocessedDataFY20.csv", index=False, encoding='utf-8-sig')

### Preprocessing FY19 Data

#### Tokenization for FY19 budget documents

In [3]:
# Here we are compressing budget documents for to get them ready for tokenization
# Compressing processUsing the tool created by Andrea Bruschi. For reference see https://pypi.org/project/pylovepdf/

from pylovepdf.tools.compress import Compress

t = Compress('public_key', verify_ssl=True)


NameError: name 'Compress' is not defined