# Arlington 2050 Wrap Up

## Summary

- I participated in the Arlington 2050 project, which is a project created by Arlington County in Virginia, to obtain information from residents about Arlington's future. The county set up different polling stands where residents were given a post card and asked to answer a couple questions. 
- The questions were, "Share your message from the future here!", "Getting here wasn't easy, but it was worth it! Here is how we did it:"
- The part I played in this whole project was analyising all the data, and creating visuals with it.
- I specifically worked on the data collected from the county fair.

Lets begin writing the code. First lets import pandas.

In [2]:
import pandas as pd

Now lets load the excel file with all our data into pandas, and lets view the columns.

In [None]:
array1 = pd.read_excel("../../CountyFair.xlsx")
array1.info()

As we can see there are some unnamed columns with data, so lets change their names to accurately match the data they store.

In [None]:
ds = array1.rename(columns={ "Unnamed: 1": "Year2050", "Unnamed: 2": "Translation1", "Unnamed: 3": "Getting_Here","Unnamed: 4": "Translation2"})
ds.info()

Now that the columns all have names, we can begin our analysis. Lets import some libraries that can help us analyze the data.

In [5]:
import spacy
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from spacytextblob.spacytextblob import SpacyTextBlob

When looking through the data, some of the resposes were in spanish. To avoid errors, lets replace them with their translated version.

In [6]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

string_list = ds['Year2050'].tolist()
spanish_list = ds['Translation1'].tolist()
IndexCounter = 0
for n in spanish_list: # Gets non null values from the spanish translated list, if they're not null then it appends to the corresponding index of the main list.
    workingstring = str(n)
    if workingstring != 'nan':
        string_list[IndexCounter] = workingstring
    IndexCounter += 1

Now we can seperate all of the words used in our data, this will allow us to find the most common words.

In [7]:
text = ds['Year2050'].str.cat(sep='')

doc = nlp(text)

words = [token.text for token in doc if not token.is_stop and not token.is_punct]

Once we have our most common word list, we can create a wordcloud to visualize what words were the most common.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100, mask=None, contour_width=3, contour_color='steelblue').generate(" ".join(words))

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Some observations we can make from these word clouds are the main topics or issues that the residents brought up. These are "housing", "affordability", "parks", "school", and "community". These five words are the core values and issues of a majority of Arlington residents.

Another way to visualize the data proided is by using histograms.

To start lets use visualize the polarities and subjectivities of the responses.

In [9]:
pol_list = []
sub_list = []
t = 0
for t in range(2,len(string_list)):
    text = string_list[t]
    doc = nlp(text)
    pol_list.append(doc._.blob.polarity)
    sub_list.append(doc._.blob.subjectivity)

In this code block, I created a list of the polarities and subjectivities of each individual response and stored them in a list. Now that we have the data, we can import the libraries needed to create the histograms.

In [10]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Now we can create a histogram out of our data. Lets start with the subjectivity of the responses.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=sub_list)
plt.title('Subjectivity of Postcard Responses from County Fair')
plt.xlabel('In a range from 0 to 1')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

Now we can do the polarities of the responses.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=pol_list)
plt.title('Polarity of Postcard Responses from County Fair')
plt.xlabel('In a range from -1 to 1')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

## Demensionality Reduction

- Dimensionality reduction is a method for representing a given dataset using a lower number of features (i.e. dimensions) while still capturing the original data's meaningful properties.
- The reason we do this is to remove irrelevant or redundant features, or simply noisy data, to create a model with a lower number of variables.

## Summary

In summary, this project was used to gather and analyze responses from citizens living in Arlington Virginia. Through this project, I learned the different ways we can visualize data, and how to use pandas to clean, and adapt data to my needs. This project was fun as it was a real world use of the coding we are learning in school. 