## Student Name: Chenyi "Crystal" Zhang
## Student Email: cschmidt@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | 
| -- | -- | -- | -- | -- | 
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| 

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files (Required)

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [296]:
import sys
import os


In [297]:
import pandas as pd
import re
from pypdf import PdfReader

In [298]:
def extract_text(file_path):
    pdf_file = open(file_path, 'rb')
    pdf_reader = PdfReader(pdf_file)
    text = ''
    for page in pdf_reader.pages:
        text += page.extract_text() 
    pdf_file.close()
    return text

def extract_state_city_name(file):
    if "DC" in file:
        return "DC",  "Washington, D.C"
    parts = file.split(' ')
    pdf_extension_remove = parts[-1].split('.')
    pdf_extension_remove.pop(-1)
    state = parts[0]
    del parts[0]
    parts.pop(-1)
    parts = parts + pdf_extension_remove
    sep = ' '
    city_name = sep.join(parts)
    return state, city_name

Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [299]:
files_path = os.path.join(os.getcwd(), 'smartcity')
pdf_files = [file for file in os.listdir(files_path) if file.endswith('.pdf')]

In [300]:
data = []
for pdf in pdf_files:
    state, city = extract_state_city_name(pdf)
    try:
        raw_text = extract_text(os.path.join(files_path, pdf))
        data.append([state, city, raw_text])
    except:
        print(pdf)
df = pd.DataFrame(data, columns = ['State','City', 'Raw Text'])
df.head

<bound method NDFrame.head of    State            City                                           Raw Text
0     AK       Anchorage  CONTENTS \n1 VISION .............................
1     AL      Birmingham  aBirmingham\nRising\nBirmingham Rising! Meetin...
2     AL      Montgomery   \n \n U.S. Department of Transportation - “BE...
3     AZ   Scottsdale AZ    \n  \n \n \n \nFederal Agency Name:   U.S. D...
4     AZ          Tucson  Tucson Smart City Demonstration Proposal\nPart...
..   ...             ...                                                ...
64    VA        Richmond    \n \n \n   \n \n \n  \n      Contact Informa...
65    VA  Virginia Beach    \n1.  Project Vision  .........................
66    WA         Seattle  Beyond Traffic: USDOT Smart City Challenge\nAp...
67    WA         Spokane  USDOT Smart City Challenge -  Spokane  \nPage ...
68    WI         Madison  Building a Smart Madison  \nfor Shared Prosper...

[69 rows x 3 columns]>

## Cleaning Up PDFs (Required)


One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [301]:
import nltk
import spacy
import en_core_web_lg
import unicodedata
from contractions import CONTRACTION_MAP
import re
from nltk.corpus import wordnet
import collections
from nltk.tokenize.toktok import ToktokTokenizer
from bs4 import BeautifulSoup

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
nlp = en_core_web_lg.load()
# nlp_vec = spacy.load('en_vectors_web_lg', parse=True, tag=True, entity=True)

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    if bool(soup.find()):
        [s.extract() for s in soup(['iframe', 'script'])]
        stripped_text = soup.get_text()
        stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    else:
        stripped_text = text
    return stripped_text


#def correct_spellings_textblob(tokens):
#	return [Word(token).correct() for token in tokens]  


def simple_porter_stemming(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text


def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word
            
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens


def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]|\[|\]' if not remove_digits else r'[^a-zA-Z\s]|\[|\]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_stemming=False, text_lemmatization=True, 
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:

        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)

        # remove extra newlines
        doc = doc.translate(doc.maketrans("\n\t\r", "   "))

        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)

        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)

        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)

        # stem text
        if text_stemming and not text_lemmatization:
        	doc = simple_porter_stemming(doc)

        # remove special characters and\or digits
        #insert spaces between special characters to isolate them        
        if special_char_removal:
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)

         # lowercase the text    
        if text_lower_case:
            doc = doc.lower()

        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case, stopwords=stopwords)

        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        doc = doc.strip()
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [302]:
from collections import Counter

def get_most_common_words(text, n=10):
    word_frequencies = Counter(text.split())
    most_common_words = word_frequencies.most_common(n)
    return [word for word, count in most_common_words]

def remove_most_common_words(text, most_common_words):
    tokens = text.split()
    filtered_tokens = [token for token in tokens if token not in most_common_words]
    return ' '.join(filtered_tokens)


In [303]:
custom_stopwords = ["smart", "city", "page", "content", "appendix", ""]
city_names = df['City'].apply(lambda x: x.lower().split()).tolist()
state_abbv = df['State'].apply(lambda x: x.lower().split()).tolist()
state_names = [
    'alabama', 'alaska', 'arizona', 'arkansas', 'california', 'colorado', 'connecticut', 'delaware', 'florida',
    'georgia', 'hawaii', 'idaho', 'illinois', 'indiana', 'iowa', 'kansas', 'kentucky', 'louisiana', 'maine',
    'maryland', 'massachusetts', 'michigan', 'minnesota', 'mississippi', 'missouri', 'montana', 'nebraska',
    'nevada', 'new hampshire', 'new jersey', 'new mexico', 'new york', 'north carolina', 'north dakota', 'ohio',
    'oklahoma', 'oregon', 'pennsylvania', 'rhode island', 'south carolina', 'south dakota', 'tennessee', 'texas',
    'utah', 'vermont', 'virginia', 'washington', 'west virginia', 'wisconsin', 'wyoming'
]
custom_stopwords.extend(city_names + state_names + state_abbv)
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(custom_stopwords)
df['Initial Cleaned Text'] = normalize_corpus(df['Raw Text'], stopwords=stopwords)
df['Most Common Words'] = df['Initial Cleaned Text'].apply(get_most_common_words) 


#### Add the cleaned text to the structure you created.


In [304]:
df['Final Cleaned Text'] = df.apply(lambda row: remove_most_common_words(row['Initial Cleaned Text'], row['Most Common Words']), axis=1)

### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

I removed `GA Columbus.docx` and `NM Albuquerque.docx` because my pipeline does not parse .docx documents. 

#### Explain what additional text processing methods you used and why.

After reviewing some files, I came up with a general list of stopwords: `["smart", "city", "page", "content", "appendix", ""]`.This list will be appended by the State and city name based on the file name. In addition, I also defined two functions `get_most_common_words()` and `remove_most_common_words()` to further clean the raw texts. After applying the `normalize_corpus()` function, a column of "Most Common Words" will be created and used to remove the most common words from the texts that went through `normalize_corpus()` - this would be appended to the dataframe as the "Final Cleaned Text" column.  

#### Did you identify any potientally problematic words?

Yes. The list of words that are all problematic are smart, city, page, content, appendix, and the unicode character ''. They have decent frequency but contribute little value in the upcoming analysis. This list is constant no matter the file. State and City names are already stored in the column so they can also be removed since their occurrence also contributes little to no value. 

## Experimenting with Clustering Models (Required)

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


| Algorithm    | K = 9                     | K = 18                    | K = 36                   | Optimal k                 |
| ------------ | ------------------------- | ------------------------- | ------------------------ | ------------------------- |
| K-means      | [-0.0030, 1.2582, 1.8197] | [-0.0145, 1.2718, 1.4402] | [0.0083, 1.2775, 1.1046] | [0.0083, 1.2775, 1.1046]  |
| Hierarchical | [-0.0109, 1.9876, 2.4133] | [0.0107, 1.7109, 1.8672]  | [0.0450, 1.6431, 1.1982] | [0.0450, 1.6431, 1.1982]  |
| DBSCAN       |                           |                           |                          | [-0.0478, 2.5597, 1.3724] |


In [305]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

In [306]:
def compute_metrics(name, model, X):
    X_dense = X.toarray()
    labels = model.fit_predict(X_dense)
    if len(np.unique(labels)) > 1:
        silhouette = silhouette_score(X_dense, labels)
        calinski_harabasz = calinski_harabasz_score(X_dense, labels)
        davies_bouldin = davies_bouldin_score(X_dense, labels)
    else:
        silhouette = calinski_harabasz = davies_bouldin = np.nan

    print(f'{name}: S, CH, DB')
    print(f'[{silhouette:.4f}, {calinski_harabasz:.4f}, {davies_bouldin:.4f}]')

In [307]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Final Cleaned Text'])
k_values = [9, 18, 36]
models = {
    'KMeans': [KMeans(n_clusters=k) for k in k_values],
    'Hierarchical': [AgglomerativeClustering(n_clusters=k) for k in k_values],
    'DBSCAN': [DBSCAN()]
}

for name, model_list in models.items():
    for i, model in enumerate(model_list):
        if name != 'DBSCAN':
            print(f"{name} (k={k_values[i]})")
        else:
            print(f"{name}")
        compute_metrics(name, model, X)

KMeans (k=9)




KMeans: S, CH, DB
[-0.0030, 1.2582, 1.8197]
KMeans (k=18)




KMeans: S, CH, DB
[-0.0145, 1.2718, 1.4402]
KMeans (k=36)




KMeans: S, CH, DB
[0.0083, 1.2775, 1.1046]
Hierarchical (k=9)
Hierarchical: S, CH, DB
[-0.0109, 1.9876, 2.4133]
Hierarchical (k=18)
Hierarchical: S, CH, DB
[0.0107, 1.7109, 1.8672]
Hierarchical (k=36)
Hierarchical: S, CH, DB
[0.0450, 1.6431, 1.1982]
DBSCAN
DBSCAN: S, CH, DB
[-0.0478, 2.5597, 1.3724]


In [308]:
for name, model_list in models.items():
    silhouette_scores = []
    for model in model_list:
        X_dense = X.toarray()
        labels = model.fit_predict(X_dense)
        silhouette = silhouette_score(X_dense, labels)
        silhouette_scores.append(silhouette)
        
    optimal_k = k_values[np.argmax(silhouette_scores)]
    print(f"{name}: Optimal K = {optimal_k}")



KMeans: Optimal K = 36
Hierarchical: Optimal K = 36
DBSCAN: Optimal K = 9


#### How did you approach finding the optimal k?

My approach is to use the use the silhouette score as the benchmark. The number of clusters that produces the highest Silhouette score is the most appropriate one. 

#### What algorithm do you believe is the best? Why?

Based on the chart above, I believe Hierarchical model with 36 clusters is the best overall because it produced the highest Silhouette score and the Calinski-Harabasz and Davies-Bouldin are all not the lowest.

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [309]:
hierarchical_model = AgglomerativeClustering(n_clusters=36)
hierarchical_labels = hierarchical_model.fit_predict(X.toarray())
df['Cluster ID'] = hierarchical_labels

### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [316]:
import pickle

with open('model.pkl', 'wb') as f:
    pickle.dump(hierarchical_model, f, protocol=0)

## Derving Themes and Concepts (Required)

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [311]:
from scipy.sparse.linalg import svds

In [312]:
def low_rank_svd(matrix, singular_count=2):
    u, s, vt = svds(matrix, k=singular_count)
    return u, s, vt

def get_top_words_for_topic(vt, feature_names, num_top_words=5):
    top_words_indices = (-vt).argsort()[:, :num_top_words]
    top_words = [[feature_names[index] for index in topic] for topic in top_words_indices]
    return top_words

def remove_city_state_names(top_words, city_names, state_names):
    city_state_names = [name.lower() for name in city_names + state_names]
    filtered_top_words = [word for word in top_words if word not in city_state_names]
    return filtered_top_words

def correct_words(words):
    corrected_words = []
    for word in words:
        doc = nlp(word)
        if len(doc) > 0 and doc[0].has_vector:
            token = doc[0].text
            corrected_words.append(token)
    return corrected_words

In [313]:
num_topics = 36
u, s, vt = low_rank_svd(X.toarray(), singular_count=num_topics)

feature_names = vectorizer.get_feature_names_out()
top_words_per_topic = get_top_words_for_topic(vt, feature_names, num_top_words=5)

city_names = df['City'].unique().tolist()
state_names = df['State'].unique().tolist()
filtered_and_corrected_top_words = []

for i, top_words in enumerate(top_words_per_topic):
    filtered_words = remove_city_state_names(top_words, city_names, state_names)
    corrected_words = correct_words(filtered_words)
    filtered_and_corrected_top_words.append(corrected_words)

for i, top_words in enumerate(filtered_and_corrected_top_words):
    print(f"Topic {i+1}: {', '.join(top_words)}")

Topic 1: service, mobility, indy
Topic 2: anchorages
Topic 3: lmg, tarc
Topic 4: new
Topic 5: linc, time, dade
Topic 6: jax, dart, applicant, jea
Topic 7: pinellas, time, pg
Topic 8: madisons, uw, time, prosperity, feb
Topic 9: dade, linc, okc
Topic 10: lmg, mdot, linc, tarc
Topic 11: dade, jax, lmg, vision, service
Topic 12: okc, dade, would
Topic 13: uw, madisons, traffic, information
Topic 14: applicant, nys, dade, vdot
Topic 15: marta, madisons, uw
Topic 16: area, niagara
Topic 17: pinellas, nys, schenectady
Topic 18: traffic, icar, vision, pinellas
Topic 19: pinellas, parking, davidson, downtown
Topic 20: provide, traffic, figure
Topic 21: technology, vehicle, pinellas, service
Topic 22: public, project, system, dart
Topic 23: public, pinellas, dade, county
Topic 24: transit, public, traffic, metro, firestone
Topic 25: project, challenge, traffic, nys
Topic 26: traffic, transit, datum, niagara, project
Topic 27: vehicle, hampton, dade, vdot
Topic 28: vehicle, marta, provide, demon

### Extract themes
Write a theme for each topic (atleast a sentence each).

| Unique Topics | Keywords                                           | Theme                                                                                                 |
| ------------- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| Topic 1       | service, mobility, indy                            | Innovative services that help mobility is important                                                   |
| Topic 2       | madisons, uw, vision, pg                           | Universal collaboration can be an essential way to improve city planning.                             |
| Topic 3       | linc, davidson, clair, niagara                     | Different region's collaboration is key to succeed the goal of smart city.                            |
| Topic 4       | Dade, Cleveland's                                  | Cleveland is a great example of accessibility on the road for people who walks slow.                  |
| Topic 5       | dart, city's, reimagine                            | A key part of achieving smart city is to reimagine public transportation.                             |
| Topic 6       | Jax, dart, applicant, jea                          | Energy plays an essential role in public transportation.                                              |
| Topic 7       | pinellas, time, pg                                 | Transportation improves travel time.                                                                  |
| Topic 8       | madisons, uw, time, prosperity, feb                | University of Wisconsin contributed to the prosperity of smart city in February (not sure which year) |
| Topic 9       | Jax, lmg, mdot                                     | Transportation planning in various cities play an important role to reach the goal of smart city.     |
| Topic 10      | time, davidson, public                             | Public transportation in Davidson county saves time.                                                  |
| Topic 11      | Dade, Jax, lmg, vision, service                    | Accessibility is a long-term vision in Jacksonville.                                                  |
| Topic 12      | dart, applicant, datum, bostons, clair             | Data-driven solutions benefited Boston's public transportation.                                       |
| Topic 13      | uw, madisons, traffic, information                 | University of Wisconsin contributed great information on traffic control.                             |
| Topic 14      | applicant, nys, Dade, vdot                         | New York's public transportation benefited from several factors.                                      |
| Topic 15      | marta, madisons, uw                                | University of Wisconsin contributed to public transportation's growth.                                |
| Topic 16      | okc, pima, pag                                     | Oklahoma City's Pima Association of Governments plays an important role in achieving smart city.      |
| Topic 17      | Cleveland's, linc, okc, madisons                   | University of Wisconsin assisted Cleveland and Oklahoma City to achieve better public transportation. |
| Topic 18      | traffic, icar, vision, pinellas                    | Traffic management benefits from innovation.                                                          |
| Topic 19      | pinellas, parking, davidson, downtown              | Downtown parking stress could benefit from smart city.                                                |
| Topic 20      | provide, traffic, figure                           | Efficient transportation solutions can be provided by smart city plannings.                           |
| Topic 21      | technology, vehicle, pinellas, service             | Smart vehicle and services could also help achieving the goal of smart city.                          |
| Topic 22      | bart, pinellas, service, traffic                   | Public transit and traffic management could benefit from smart city.                                  |
| Topic 23      | public, pinellas, Dade, county                     | public transportation in Pinellas County is hoping to make more improvement in accessibility.         |
| Topic 24      | data, service, new, transportation                 | Data-driven solution can benefit transportation services.                                             |
| Topic 25      | transit, datum, marta, feb                         | Transit data is very valuable.                                                                        |
| Topic 26      | traffic, transit, datum, niagara, project          | Traffic and transit project management in Niagara county is important.                                |
| Topic 27      | firestoneu use, deck, bowery, kenmore              | Several city's development and infrastructure plays an important role in smart city visions.          |
| Topic 28      | use, system, bart, transit, public                 | Public transportation system's usage is essential.                                                    |
| Topic 29      | use, service, transit                              | Usage of transit services can be beneficial to smart city.                                            |
| Topic 30      | technology, transportation, datum, service         | Transportation's tech advancement is data driven.                                                     |
| Topic 31      | use, vehicle, traffic, uos, rhode                  | Self-driven vehicle's usage impacted traffic in Rhode Island.                                         |
| Topic 32      | transportation, use, technology, transit           | Innovative transportation solutions improve transit's usage.                                          |
| Topic 33      | transit, transportation, onal, innova, uab         | Innovative transportation and collaboration play an important role in achieving smart city goal.      |
| Topic 34      | system, bart, time, transportation, bay            | Public transportation in bay area relies on a timely system.                                          |
| Topic 35      | network, transit, information, provide, management | Transit network and information provide ways for stakeholders to manage the load.                     |
| Topic 36      | provide, include, public, use, service             | Public transportation promotes inclusive use of services                                              |

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [314]:
df['Cluster ID'] = hierarchical_labels

cluster_topic_scores = np.zeros((len(np.unique(hierarchical_labels)), num_topics))

for cluster_id in range(len(np.unique(hierarchical_labels))):
    cluster_cities_indices = np.where(hierarchical_labels == cluster_id)
    cluster_topic_scores[cluster_id] = u[cluster_cities_indices].mean(axis=0)

top_two_topics_df = pd.DataFrame(columns=['Top Topic 1', 'Top Topic 2'])

for city_index in range(df.shape[0]):
    cluster_id = df.loc[city_index, 'Cluster ID']
    cluster_topics = cluster_topic_scores[cluster_id]
    sorted_topics_indices = np.argsort(cluster_topics)[::-1][:2]
    top_two_topics = [filtered_and_corrected_top_words[topic] for topic in sorted_topics_indices]
    top_two_topics_df.loc[city_index] = [', '.join(top_two_topics[0]), ', '.join(top_two_topics[1])]

df['Top Topic 1'] = top_two_topics_df['Top Topic 1']
df['Top Topic 2'] = top_two_topics_df['Top Topic 2']

## Write output data (Required)

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [315]:
df.to_csv('smartcity_eda.csv', index=False, escapechar='ᚦ')

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
