## Student Name: Harsha Vardhan
## Student Email: Donkada.H.Vardhan-1@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with $$$ are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | summary | keywords|
| -- | -- | -- | -- | -- | -- | -- |
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| test | test |

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [None]:
import pypdf
import os
import pandas as pd
pdf_path = "smartcity"
city_name = []
raw_text = []
for filename in os.listdir(pdf_path):
    if filename.endswith('.pdf'):
        pdf_file = open(os.path.join(pdf_path, filename), 'rb')
        pdf_reader = pypdf.PdfReader(pdf_file)
        for page in range(len(pdf_reader.pages)):
            page = pdf_reader._get_page(page)
            text = page.extract_text()
            raw_text.append(text)
            file_name = os.path.basename(filename)
            base_name = os.path.splitext(file_name)[0]
            city = base_name.split('_')[0]
            city_name.append(city)

Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [None]:
df = pd.DataFrame({'City': city_name, 'raw_text': raw_text})

## Cleaning Up PDFs

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

nltk.download('stopwords')

sw = set(stopwords.words('english')).union(['city', 'cities', 'state', 'states', 'page', 'smart'])
lemmatizer = WordNetLemmatizer()
cleaned_text = []
for text in df['raw_text']:
    text = re.sub(r'[^\w\s]', '', text)
    words = word_tokenize(text.lower())
    words = [lemmatizer.lemmatize(word) for word in words if word not in sw]
    cleaned_text.append(' '.join(words))

#### Add the cleaned text to the structure you created.


In [None]:
df['Cleaned_Text'] = cleaned_text
cities_to_remove = ['OH Toledo', 'CA Moreno Valley', 'TX Lubbock', 'NV Reno', 'FL Tallahassee', 
                    'NY Mt Vernon Yonkers New Rochelle', 'VA Newport News', 'TN Nashville', 'OK Oklahoma']
df = df[~df['City'].isin(cities_to_remove)]
df = df[df['Cleaned_Text'] != '']
df.reset_index(drop=True, inplace=True)

### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

I removed 7 cities as because of the images and tables OH Toledo, CA Moreno Valley, NY Mt Vernon Yonkers New Rochelle, TN Nashville,VA Newport News, TX Lubbock, NV Reno, FL Tallahassee, OK Oklahoma

#### Explain what additional text processing methods you used and why.

#### Did you identify any potientally problematic words?

No

## Experimenting with Clustering Models

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|--|
|Hierarchical |--|--|--|--|
|DBSCAN | X | X | X | -- |



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import numpy as np

k_values = [9, 18, 36]
    k_range = range(2, 50)
    kmeans_val = []
    hierarchical_val = []
    optimal_score=[]

    optimal_k_kmeans = 0
    optimal_kmeans = -1
    optimal_kmeans_calinski = -1
    optimal_kmeans_devis = -1

    optimal_k_hierarchical = 0
    optimal_hierarchical = -1
    optimal_hierarchical_calinski = -1
    optimal_hierarchical_devis = -1

    optimal_k_dbscan = 0
    optimal_dbscan = -1
    optimal_dbscan_calinski = -1
    optimal_dbscan_davis = -1

    for city in cleaned_df['City'].unique():
        city_df = cleaned_df[cleaned_df['City'] == city]
        vectorizer = TfidfVectorizer()
        vectorized_text = vectorizer.fit_transform(city_df['Cleaned_Text'])
        for k in k_values:
            if(vectorized_text.shape[0]>=k):
                kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
                labels = kmeans.fit_predict(vectorized_text)
                silhouette = silhouette_score(vectorized_text, labels)
                calinski = calinski_harabasz_score(vectorized_text.toarray(), labels)
                davies = davies_bouldin_score(vectorized_text.toarray(), labels)
                kmeans_val.append({'k':k,'city':city,'silhouette':silhouette, 'calinski':calinski, 'davies':davies})
                
                hierarchical = AgglomerativeClustering(n_clusters=k)
                hierarchical_labels = hierarchical.fit_predict(vectorized_text.toarray())
                hierarchical_silhouette = silhouette_score(vectorized_text, hierarchical_labels)
                hierarchical_calinski = calinski_harabasz_score(vectorized_text.toarray(), hierarchical_labels)
                hierarchical_davies = davies_bouldin_score(vectorized_text.toarray(), hierarchical_labels)
                hierarchical_val.append({'k':k,'city':city,'silhouette':hierarchical_silhouette, 'calinski':hierarchical_calinski, 'davies':hierarchical_davies})
            else:
                kmeans_val.append({'k':k,'city':city,'silhouette':0, 'calinski':0, 'davies':0})
                hierarchical_val.append({'k':k,'city':city,'silhouette':0, 'calinski':0, 'davies':0})
            
        for k in k_range:
            if(vectorized_text.shape[0]>k):
                kmeans = KMeans(n_clusters=k, random_state=0, n_init="auto")
                kmeans_labels = kmeans.fit_predict(vectorized_text)
                k_means_silhouette = silhouette_score(vectorized_text, kmeans_labels)
                k_means_calinski = calinski_harabasz_score(vectorized_text.toarray(), kmeans_labels)
                k_means_davies = davies_bouldin_score(vectorized_text.toarray(), kmeans_labels)
                if k_means_silhouette > optimal_kmeans:
                    optimal_k_kmeans = k
                    optimal_kmeans = k_means_silhouette
                    optimal_kmeans_calinski = k_means_calinski
                    optimal_kmeans_davies = k_means_davies

                hierarchical = AgglomerativeClustering(n_clusters=k)
                hierarchical_labels = hierarchical.fit_predict(vectorized_text.toarray())
                hierarchical_silhouette = silhouette_score(vectorized_text, hierarchical_labels)
                hierarchical_calinski = calinski_harabasz_score(vectorized_text.toarray(), hierarchical_labels)
                hierarchical_davies = davies_bouldin_score(vectorized_text.toarray(), hierarchical_labels)
                if hierarchical_silhouette > optimal_hierarchical:
                    optimal_k_hierarchical = k
                    optimal_hierarchical = hierarchical_silhouette
                    optimal_hierarchical_calinski = hierarchical_calinski
                    optimal_hierarchical_davies = hierarchical_davies
                    
                dbscan = DBSCAN(eps=0.5, min_samples=k)
                dbscan_labels = dbscan.fit_predict(vectorized_text)
                if len(np.unique(dbscan_labels)) > 1: 
                    dbscan_silhouette = silhouette_score(vectorized_text, dbscan_labels)
                    dbscan_calinski = calinski_harabasz_score(vectorized_text.toarray(), dbscan_labels)
                    dbscan_davies = davies_bouldin_score(vectorized_text.toarray(), dbscan_labels)
                    if dbscan_silhouette > optimal_dbscan:
                        optimal_k_dbscan = k
                        optimal_dbscan = dbscan_silhouette
                        optimal_dbscan_calinski = dbscan_calinski
                        optimal_dbscan_davis = dbscan_davies
                        
        
                    
        optimal_score.append({'City':city, 'optimal_k_kmeans':optimal_k_kmeans, 'optimal_k_hierarchical':optimal_k_hierarchical, 'optimal_k_dbscan':optimal_k_dbscan,
                            'k_means_silhouette':optimal_kmeans,'k_means_calinski':optimal_kmeans_calinski, 'k_means_davies': optimal_kmeans_davies, 
                            'hierarchical_silhouette':optimal_hierarchical, 'hierarchical_calinski':optimal_hierarchical_calinski, 'hierarchical_davies':optimal_hierarchical_davies,
                            'dbscan_silhouette': optimal_dbscan, 'dbscan_calinski': optimal_dbscan_calinski, 'dbscan_davies':optimal_dbscan_davis})

    k_means_values = [{'k-value':d['k'],'City':d['city'],'K-Means': [d['silhouette'], d['calinski'], d['davies']]} for d in kmeans_val]
    heirarchical_values = [{'Hierarchical': [d['silhouette'], d['calinski'], d['davies']]} for d in hierarchical_val]
    optimal_values = [{'City':d['City'],'optimal-k_means':d['optimal_k_kmeans'], 'optimal_k_hierarchical': d['optimal_k_hierarchical'],'optimal_k_dbscan': d['optimal_k_dbscan'], 
                    'k_means_optimal_score':[d['k_means_silhouette'], d['k_means_calinski'], d['k_means_davies']], 
                    'Hierarchical_optimal_score':[d['hierarchical_silhouette'],d['hierarchical_calinski'], d['hierarchical_davies']],
                    'DBSCAN_optimal_score':[d['dbscan_silhouette'], d['dbscan_calinski'], d['dbscan_davies']]}for d in optimal_score]

k_means_values = [{'k-value':d['k'],'City':d['city'],'K-Means': [d['silhouette'], d['calinski'], d['davies']]} for d in kmeans_val]
heirarchical_values = [{'Hierarchical': [d['silhouette'], d['calinski'], d['davies']]} for d in hierarchical_val]
optimal_values = [{'City':d['City'],'optimal-k_means':d['optimal_k_kmeans'], 'optimal_k_hierarchical': d['optimal_k_hierarchical'],'optimal_k_dbscan': d['optimal_k_dbscan'], 
                  'k_means_optimal_score':[d['k_means_silhouette'], d['k_means_calinski'], d['k_means_davies']], 
                  'Hierarchical_optimal_score':[d['hierarchical_silhouette'],d['hierarchical_calinski'], d['hierarchical_davies']],
                  'DBSCAN_optimal_score':[d['dbscan_silhouette'], d['dbscan_calinski'], d['dbscan_davies']]}for d in optimal_score]

k_means_df = pd.DataFrame(k_means_values)
heirarchical_df = pd.DataFrame(heirarchical_values)

result_df = pd.concat([scores_df,heirarchical_df], axis=1)
print(result_df)
print(optimal_values)

In [None]:
opt_i = []
for d in optimal_values:
    city = d['City']
    k_means_optimal_k = d['optimal-k_means']
    hierarchical_optimal_k = d['optimal_k_hierarchical']
    dbscan_optimal_k = d['optimal_k_dbscan']
    k_means_scores = d['k_means_optimal_score']
    hierarchical_scores = d['Hierarchical_optimal_score']
    dbscan_scores = d['DBSCAN_optimal_score']
    if k_means_scores[0]>=hierarchical_scores[0] and k_means_scores[0]>=dbscan_scores[0]:
        opt = k_means_optimal_k
        opt_i.append({'City':city, 'clusterid': opt})
    elif hierarchical_scores[0] >= k_means_scores[0] and hierarchical_scores[0] >= dbscan_scores[0]:
        opt = hierarchical_optimal_k
        opt_i.append({'City':city, 'clusterid': opt})
    else:
        opt = dbscan_optimal_k
        opt_i.append({'City':city, 'clusterid': opt})
        
        
print(opt_i)

#### How did you approach finding the optimal k?

To determine the optimal k, I iterated through a range of k values from 2 to 51 and computed the silhouette score for each k value. The k value that yielded the highest silhouette score was considered the optimal k.

#### What algorithm do you believe is the best? Why?

I believe dbscan as the best as along with k values it also takes different epsilon values to find optimal K.

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [None]:
clusterid_df = pd.DataFrame(opt_i)
df = pd.merge(df, clusterid_df, on="City")
df

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

## Derving Themes and Concepts

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

### Extract themes
Write a theme for each topic (atleast a sentence each).

[Your Answer]

[Your Answer]

[Your Answer]

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

## Gathering Applicant Summaries and Keywords

For each smart city applicant, gather a summary and keywords that are important to that document. You can use gensim to do this. Here are examples of functions that you could use.

```python

from gensim.summarization import summarize

def summary(text, ratio=0.2, word_count=250, split=False):
    return summarize(text, ratio= ratio, word_count=word_count, split=split)
    
from gensim.summarization import keywords

def keys(text, ratio=0.01):
    return keywords(text, ratio=ratio)
```

### Add Summaries and Keywords
Add summary and keywords to output file.

## Write output data

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [None]:
df.to_csv('smartcity_eda.tsv', sep='\t')

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
