## Student Name: Fazil Raja
## Student Email: fazilraja11@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | 
| -- | -- | -- | -- | -- | 
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| 

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files (Required)

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [103]:
import PyPDF2
import os


directory = 'smartcity'
files_data = {}

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        filepath = os.path.join(directory, filename)
        with open(filepath, 'rb') as pdfFileObj:
            pdfReader = PyPDF2.PdfReader(pdfFileObj)
            num_pages = len(pdfReader.pages)
            text = ""
            for i in range(num_pages):
                page = pdfReader.pages[i]
                page_text = page.extract_text()
                page_lines = page_text.splitlines()
                text += page_text
            files_data[filename] = text
            print(filename)
            continue


VA Norfolk.pdf
KY Louisville.pdf
MN Minneapolis St Paul.pdf
CA Oceanside.pdf
DC_0.pdf
CA Chula Vista.pdf
FL Jacksonville.pdf
TN Memphis.pdf
IA Des Moines.pdf
OH Toledo.pdf
NJ Newark.pdf
NC Charlotte.pdf
NC Raleigh.pdf
LA New Orleans.pdf
CA Moreno Valley.pdf
MO St. Louis.pdf
IN Indianapolis.pdf
AL Montgomery.pdf
NC Greensboro.pdf
FL St. Petersburg.pdf
CA Riverside.pdf
NE Omaha.pdf
TN Chattanooga.pdf
NY Albany Troy Schenectady Saratoga Springs.pdf
CT NewHaven.pdf
LA Baton Rouge.pdf
RI Providence.pdf
OH Akron.pdf
OK Oklahoma City.pdf
FL Orlando.pdf
MD Baltimore.pdf
VA Virginia Beach.pdf
VA Richmond.pdf
CA Oakland.pdf
FL Tampa.pdf
WA Spokane.pdf
MA Boston.pdf
CA Long Beach.pdf
AZ Scottsdale AZ.pdf


unknown widths : 
[0, IndirectObject(532, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(535, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(538, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(541, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(544, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(547, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(550, 0, 140693742553120)]
unknown widths : 
[0, IndirectObject(553, 0, 140693742553120)]


TN Nashville.pdf
NY Mt Vernon Yonkers New Rochelle.pdf
NE Lincoln.pdf
OH Canton.pdf
SC Greenville.pdf
FL Miami.pdf
CA Sacramento.pdf
AZ Tucson.pdf
GA Brookhaven.pdf
NY Buffalo.pdf
LA Shreveport.pdf
TX Lubbock.pdf
NY Rochester.pdf
CA Fremont.pdf
NV Las Vegas.pdf
MI Port Huron and Marysville.pdf
AL Birmingham.pdf
GA Atlanta.pdf
WI Madison.pdf
VA Newport News.pdf
WA Seattle.pdf
OK Tulsa.pdf
NJ Jersey City.pdf
OH Cleveland.pdf
MI Detroit.pdf
CA San Jose_0.pdf
NV Reno.pdf
CA Fresno.pdf
AK Anchorage.pdf
FL Tallahassee.pdf


Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [4]:
# Added it above


## Cleaning Up PDFs (Required)

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [93]:
import nltk
import spacy
import unicodedata
from contractions import CONTRACTION_MAP
import re
from nltk.corpus import wordnet
import collections
#from textblob import Word
from nltk.tokenize.toktok import ToktokTokenizer
from bs4 import BeautifulSoup

nltk.download('stopwords')

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
nlp = spacy.load('en_core_web_sm', exclude=['parser'])
# nlp_vec = spacy.load('en_vectors_web_lg', parse=True, tag=True, entity=True)

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    if bool(soup.find()):
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    else:
        stripped_text = text
    return stripped_text


#def correct_spellings_textblob(tokens):
#	return [Word(token).correct() for token in tokens]  


def simple_porter_stemming(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text


def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens


def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]|\[|\]' if not remove_digits else r'[^a-zA-Z\s]|\[|\]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def remove_cities_states(text) :
    labels = ['GPE', 'LOC', 'PERSON', 'ORG']

    cities_states = ['Brookhaven', 'Tallahassee', 'Buffalo', 'Riverside', 'Scottsdale', 'Jacksonville', 'New Orleans', 'Montgomery', 'Port Huron', 'Marysville', 'Seattle', 'Shreveport', 'Spokane', 'Indianapolis', 'Birmingham', 'Baton Rouge', 'Miami', 'Oceanside', 'San Jose', 'Lincoln', 'Boston', 'Sacramento', 'Richmond', 'Atlanta', 'Rochester', 'Memphis', 'Raleigh', 'Albany', 'Troy', 'Schenectady', 'Saratoga Springs', 'Cleveland', 'Charlotte', 'Jersey City', 'Chula Vista', 'Long Beach', 'Detroit', 'Des Moines', 'St. Louis', 'Omaha', 'Akron', 'Newport News', 'Mt Vernon', 'Yonkers', 'New Rochelle', 'Fremont', 'Baltimore', 'Greenville', 'NewHaven', 'Lubbock', 'Fresno', 'Oakland', 'Chattanooga', 'Providence', 'Anchorage', 'Tucson', 'Minneapolis', 'Reno', 'Toledo', 'Greensboro', 'Canton', 'Las Vegas', 'Nashville', 'Oklahoma City', 'Madison', 'Newark', 'Louisville', 'St. Petersburg', 'Moreno Valley', 'Tampa', 'Norfolk', 'Washington, DC', 'Orlando', 'Virginia Beach', 'Tulsa']

    doc = nlp(text)

    for ent in doc.ents:
        if ent.label_ in labels or ent.text in cities_states:
            text = text.replace(ent.text, '')

    return text


def normalize_corpus(corpus,html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_stemming=False, text_lemmatization=True, 
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list, cites_states = True):
    
    normalized_corpus = {}
    # normalize each document in the corpus
    count = 0
    for key in corpus.keys():
        count += 1
        print(count)
        doc = corpus[key]
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        
        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))

        # remove states and cities
        if cites_states:
            doc = remove_cities_states(doc)
            
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)

        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)

        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)

        # stem text
        if text_stemming and not text_lemmatization:
        	doc = simple_porter_stemming(doc)

        # remove special characters and\or digits  
        print("remove_special_characters")  
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)

         # lowercase the text    
        if text_lower_case:
            doc = doc.lower()

        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case, stopwords=stopwords)

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()
            
        normalized_corpus[key] = doc
    print("---------------------returning---------------------")
    return normalized_corpus

[nltk_data] Downloading package stopwords to /Users/Fazil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Add the cleaned text to the structure you created.


In [94]:
files_normalized_data = normalize_corpus(files_data)


1
remove_special_characters
2
remove_special_characters
3
remove_special_characters
4
remove_special_characters
5
remove_special_characters
6
remove_special_characters
7
remove_special_characters
8
remove_special_characters
9
remove_special_characters
10
remove_special_characters
11
remove_special_characters
12
remove_special_characters
13
remove_special_characters
14
remove_special_characters
15
remove_special_characters
16
remove_special_characters
17
remove_special_characters
18
remove_special_characters
19
remove_special_characters
20
remove_special_characters
21
remove_special_characters
22
remove_special_characters
23
remove_special_characters
24
remove_special_characters
25
remove_special_characters
26
remove_special_characters
27
remove_special_characters
28
remove_special_characters
29
remove_special_characters
30
remove_special_characters
31
remove_special_characters
32
remove_special_characters
33
remove_special_characters
34
remove_special_characters
35
remove_special_chara

### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

As of right now I did not remove an applicants as the data looks clean. I did not see any issues with the documents.

#### Explain what additional text processing methods you used and why.

I used labels to remove Cities and States from the text. I created a list of all the cities used in smartcity/ and used that to remove those names from the text too.

#### Did you identify any potientally problematic words?

None for now, as the normalization removed quite a few words.

## Experimenting with Clustering Models (Required)

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [0.0831, 2.8213, 2.1941] | [0.0407, 2.2966, 1.5409] | [0.0847, 2.2808, 1.0074] |  |
|Hierarchical |[0.1024, 3.1921, 2.0224] | [0.0964, 2.7341, 1.4481] | [0.093, 2.3776, 1.0526] | [S,CH,DB]|
|DBSCAN | [0.1337, 1.9308, 2.8541] | [0.1337, 1.9308, 2.8541] | [0.1309, 1.9897, 3.1638] | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|36|
|Hierarchical |--|--|--|36|
|DBSCAN | X | X | X | 36 |

In [95]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# create tfidf matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(files_normalized_data.values())

for k in [9, 18, 36]:
    print("k = " + str(k))
    kmeans = KMeans(n_clusters=k, random_state=0).fit(tfidf_matrix)

    # dbscan
    dbscan = DBSCAN(eps=1, min_samples=k).fit_predict(tfidf_matrix)

    # agglomerative
    sparseArray = tfidf_matrix.toarray()
    agglomerative = AgglomerativeClustering(n_clusters=k).fit_predict(sparseArray)

    kmeans_score = silhouette_score(tfidf_matrix, kmeans.labels_)
    dbscan_score = silhouette_score(tfidf_matrix, dbscan)
    agglomerative_score = silhouette_score(tfidf_matrix, agglomerative)

    k_means_davies_bouldin_score = davies_bouldin_score(tfidf_matrix.toarray(), kmeans.labels_)
    dbscan_davies_bouldin_score = davies_bouldin_score(tfidf_matrix.toarray(), dbscan)
    agglomerative_davies_bouldin_score = davies_bouldin_score(tfidf_matrix.toarray(), agglomerative)

    k_means_calinski_harabasz_score = calinski_harabasz_score(tfidf_matrix.toarray(), kmeans.labels_)
    dbscan_calinski_harabasz_score = calinski_harabasz_score(tfidf_matrix.toarray(), dbscan)
    agglomerative_calinski_harabasz_score = calinski_harabasz_score(tfidf_matrix.toarray(), agglomerative)


    print([round(kmeans_score, 4), round(k_means_calinski_harabasz_score, 4), round(k_means_davies_bouldin_score, 4)])
    print([round(dbscan_score, 4), round(dbscan_calinski_harabasz_score, 4), round(dbscan_davies_bouldin_score, 4)])
    print([round(agglomerative_score, 4), round(agglomerative_calinski_harabasz_score, 4), round(agglomerative_davies_bouldin_score, 4)])


k = 9




[0.0831, 2.8213, 2.1941]
[0.1337, 1.9308, 2.8541]
[0.1024, 3.1921, 2.0224]
k = 18




[0.0407, 2.2966, 1.5409]
[0.1337, 1.9308, 2.8541]
[0.0964, 2.7341, 1.4481]
k = 36




[0.0847, 2.2808, 1.0074]
[0.1309, 1.9897, 3.1638]
[0.093, 2.3776, 1.0526]


#### How did you approach finding the optimal k?

In order to pick the best k, we need to maximize the silhouette and Calinski score while minimizing the Davies-Bouldin score. I used a for loop to go through each k and print out the values and then compared them. After comparing the values, I picked the k that had the best scores, which was 36.

#### What algorithm do you believe is the best? Why?

None of them as neither of the 3 topics are the best at extracting themes in a model. A better model would be a LDA model which is proven to be best at extracting themes in a model. But if I had a choice I would use K means as it is simple and pretty efficient on large datasets such as ours. 

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [96]:
text_data = list(files_normalized_data.values())
text_vectors = tfidf_vectorizer.fit_transform(text_data)

k  = 36
kmeans = KMeans(n_clusters=k, random_state=0).fit(text_vectors)

cluster_ids = kmeans.predict(text_vectors)
files_cluster_data = files_normalized_data.copy()

for i, city in enumerate(files_normalized_data.keys()):
    files_cluster_data[city] = cluster_ids[i]

print(files_cluster_data)



{'VA Norfolk.pdf': 10, 'KY Louisville.pdf': 10, 'MN Minneapolis St Paul.pdf': 10, 'CA Oceanside.pdf': 30, 'DC_0.pdf': 13, 'CA Chula Vista.pdf': 17, 'FL Jacksonville.pdf': 22, 'TN Memphis.pdf': 33, 'IA Des Moines.pdf': 29, 'OH Toledo.pdf': 1, 'NJ Newark.pdf': 12, 'NC Charlotte.pdf': 10, 'NC Raleigh.pdf': 5, 'LA New Orleans.pdf': 2, 'CA Moreno Valley.pdf': 1, 'MO St. Louis.pdf': 4, 'IN Indianapolis.pdf': 10, 'AL Montgomery.pdf': 25, 'NC Greensboro.pdf': 14, 'FL St. Petersburg.pdf': 20, 'CA Riverside.pdf': 10, 'NE Omaha.pdf': 4, 'TN Chattanooga.pdf': 3, 'NY Albany Troy Schenectady Saratoga Springs.pdf': 4, 'CT NewHaven.pdf': 10, 'LA Baton Rouge.pdf': 4, 'RI Providence.pdf': 33, 'OH Akron.pdf': 18, 'OK Oklahoma City.pdf': 34, 'FL Orlando.pdf': 10, 'MD Baltimore.pdf': 6, 'VA Virginia Beach.pdf': 7, 'VA Richmond.pdf': 11, 'CA Oakland.pdf': 2, 'FL Tampa.pdf': 15, 'WA Spokane.pdf': 31, 'MA Boston.pdf': 2, 'CA Long Beach.pdf': 10, 'AZ Scottsdale AZ.pdf': 21, 'TN Nashville.pdf': 10, 'NY Mt Verno

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [112]:
from joblib import dump, load
dump(kmeans, 'model.pkl')

pred = kmeans.predict(text_vectors[0])
print(kmeans.cluster_centers_.shape)
print(type(text_data))
print(type(text_vectors[0]))

(36, 20353)
<class 'list'>
<class 'scipy.sparse._csr.csr_matrix'>


## Derving Themes and Concepts (Required)

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [70]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

TOTAL_TOPICS = 36

lda_model = LatentDirichletAllocation(n_components=TOTAL_TOPICS, random_state=0)

doc = lda_model.fit(text_vectors)

topic_word = lda_model.components_

vocab = np.array(tfidf_vectorizer.get_feature_names_out())

topic_key_terms_idxs = np.argsort(-np.absolute(topic_word), axis=1)[:, :36]
topic_keyterms = vocab[topic_key_terms_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
topics_df = pd.DataFrame(topics, columns = ['Terms per Topic'], index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df

Unnamed: 0,Terms per Topic
Topic1,"information, improve, make, region, corridor, ..."
Topic2,"aa, php, photovoltaic, photography, photograph..."
Topic3,"hr, oakland, awalk, vaa, seeclickfix, difficul..."
Topic4,"challengeapplication, traic, synertic, eective..."
Topic5,"sign, chlenge, page, fiber, emission, projt, l..."
Topic6,"opportunity, mile, planning, innovation, desig..."
Topic7,"oks, dash, wegolook, oklahoma, frail, itn, ini..."
Topic8,"transporta, innova, eff, exis, addi, iden, cri..."
Topic9,"aa, php, photovoltaic, photography, photograph..."
Topic10,"aa, php, photovoltaic, photography, photograph..."


### Extract themes
Write a theme for each topic (atleast a sentence each).

Topic1: Regional Development and Corridor Improvement
Topic2: Photography and Photovoltaic Technology
Topic3: Human Resources and Civic Engagement in Oakland
Topic4: Effective Traffic Management and Synergetic Applications
Topic5: Signage and Fiber Emission Reduction Projects
Topic6: Urban Planning and Innovative Design Opportunities
Topic7: Frailty and Mobility in Oklahoma: Initiatives and Innovations
Topic8: Efficient and Innovative Transportation Solutions
Topic9: Photography and Photovoltaic Technology
Topic10: Photography and Photovoltaic Technology
Topic11: Photography and Photovoltaic Technology
Topic12: Legally-compliant Transloading and Plugless Technology
Topic13: Community Communication and Funding Opportunities
Topic14: Photography and Photovoltaic Technology
Topic15: Mobility Solutions for Newark and Anchorage
Topic16: Data-driven Transportation Systems and Vehicles
Topic17: Cotton Belt Rail Line and MARTA in English Avenue
Topic18: Development and Implementation of Educational Programs
Topic19: Poverty Reduction, Neighborhood Revitalization, and Transportation Access
Topic20: Federal Initiatives and Community-based Programs
Topic21: Photography and Photovoltaic Technology
Topic22: Clutter Reduction and Tariff Optimization in Transit
Topic23: Trade, Security, and Infrastructure in Jacksonville
Topic24: Pinellas County Technological Advancements and Connectivity
Topic25: Photography and Photovoltaic Technology
Topic26: University Management and Sustainable Maintenance
Topic27: Bike Counts and App-based Transportation Solutions
Topic28: Programmable Simulators for Radical Teaching
Topic29: SDOT and Bi-National Mapping and Planning
Topic30: Title Disclosure and Parking Management
Topic31: Traffic Management Beyond Winston-Salem Challenges
Topic32: River Parkway Development and Beeline Bus Service
Topic33: Photography and Photovoltaic Technology
Topic34: Regional Approaches to Transportation in St. Thomas
Topic35: Local Vision and Downtown Proposals
Topic36: Ferry Projects and Synergistic Progression

[Your Answer]

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [86]:
topic_dist = lda_model.transform(text_vectors)

# sort topic distribution for each document
top_topics = []
for i, doc in enumerate(topic_dist):
    sorted_topics = sorted(enumerate(doc), key=lambda x: x[1], reverse=True)
    top_topics.append(sorted_topics)

keys = list(files_normalized_data.keys())
# print top topics for each document with probabilities
for i, doc in enumerate(text_vectors):
    print(f"Document {i}: {keys[i]}")
    print(f"Topic {top_topics[i][0]}: {top_topics[i][1]}")
    print("\n")

Document 0: VA Norfolk.pdf
Topic (15, 0.458828416126366): (0, 0.14817431876993228)


Document 1: KY Louisville.pdf
Topic (15, 0.41276745669931014): (34, 0.33353207488892866)


Document 2: MN Minneapolis St Paul.pdf
Topic (15, 0.4694562157330299): (34, 0.12673912759703665)


Document 3: CA Oceanside.pdf
Topic (15, 0.4193131265597532): (21, 0.11341321129327087)


Document 4: DC_0.pdf
Topic (15, 0.409184239876521): (0, 0.2446824059487731)


Document 5: CA Chula Vista.pdf
Topic (17, 0.644166154033635): (15, 0.1341613656965425)


Document 6: FL Jacksonville.pdf
Topic (15, 0.37311080656869655): (0, 0.1265661346024401)


Document 7: TN Memphis.pdf
Topic (15, 0.4764023948034659): (0, 0.1251536293332473)


Document 8: IA Des Moines.pdf
Topic (15, 0.46445217871114564): (0, 0.16227348316384468)


Document 9: OH Toledo.pdf
Topic (0, 0.027777777777777776): (1, 0.027777777777777776)


Document 10: NJ Newark.pdf
Topic (15, 0.29191162579949353): (14, 0.20391791487669184)


Document 11: NC Charlotte.pd

In [90]:
# add the topic to the dictionary
files_topic_data = files_normalized_data.copy()
for i, doc in enumerate(text_vectors):
    files_topic_data[keys[i]] = top_topics[i][0][0], top_topics[i][1][0]

print(files_topic_data)

{'VA Norfolk.pdf': (15, 0), 'KY Louisville.pdf': (15, 34), 'MN Minneapolis St Paul.pdf': (15, 34), 'CA Oceanside.pdf': (15, 21), 'DC_0.pdf': (15, 0), 'CA Chula Vista.pdf': (17, 15), 'FL Jacksonville.pdf': (15, 0), 'TN Memphis.pdf': (15, 0), 'IA Des Moines.pdf': (15, 0), 'OH Toledo.pdf': (0, 1), 'NJ Newark.pdf': (15, 14), 'NC Charlotte.pdf': (15, 5), 'NC Raleigh.pdf': (15, 0), 'LA New Orleans.pdf': (15, 34), 'CA Moreno Valley.pdf': (0, 1), 'MO St. Louis.pdf': (15, 4), 'IN Indianapolis.pdf': (15, 0), 'AL Montgomery.pdf': (5, 15), 'NC Greensboro.pdf': (15, 0), 'FL St. Petersburg.pdf': (15, 34), 'CA Riverside.pdf': (15, 34), 'NE Omaha.pdf': (15, 0), 'TN Chattanooga.pdf': (15, 26), 'NY Albany Troy Schenectady Saratoga Springs.pdf': (0, 15), 'CT NewHaven.pdf': (15, 0), 'LA Baton Rouge.pdf': (15, 0), 'RI Providence.pdf': (15, 34), 'OH Akron.pdf': (23, 15), 'OK Oklahoma City.pdf': (15, 6), 'FL Orlando.pdf': (15, 34), 'MD Baltimore.pdf': (15, 25), 'VA Virginia Beach.pdf': (15, 23), 'VA Richmond

## Gathering Applicant Summaries and Keywords (Extra Credit Section)

For each smart city applicant, gather a summary and keywords that are important to that document. Gensim is outdated; try a spacy or nltk method.



### Add Summaries and Keywords
Add summary and keywords to output file.

## Write output data (Required)

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [102]:
import csv

with open("smartcity_eda.tsv", "w") as f:
    writer = csv.writer(f, delimiter="\t")

    writer.writerow(["city", "raw text", "clean text", "cluster id", "topic id"])

    for city in files_data.keys():
        writer.writerow([city, files_data[city], files_normalized_data[city], files_cluster_data[city], files_topic_data[city]])

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
