---
title: Data Gathering
format:
  html:
    css: h1
    embed-resources: true
    code-fold: true
---

All data files, whether generated using code or directly downloaded from the source can be found and downloaded in the data tab of the website.

## APIs

### Reddit API

The following code extracted reddit urls from reddit posts about Public Transportation and saved them to a .json.


```{r}
library(RedditExtractoR)
library(jsonlite)

top_Pub_Transp_urls <- find_thread_urls(keywords="public transportation")
jsonlite::write_json(top_Pub_Transp_urls, "top_pub_transp_urls.json")
```

Then, the following code extracted the content of those reddit posts and performed sentiment analysis on them, generated a data frame, and saved it to a .csv file


In [None]:
import pandas as pd
import json

# Load the sentiment scores from the JSON file
with open('sentiment_scores.json', 'r') as json_file:
    sentiment_scores = json.load(json_file)

# Initialize lists to store data
ids = []
neg_scores = []
neu_scores = []
pos_scores = []
compound_scores = []

# Extract the scores and create separate lists for each
for idx, item in enumerate(sentiment_scores, start=1):
    ids.append(idx)
    sentiment_score = item.get('sentiment_score', {})
    neg_scores.append(sentiment_score.get('neg', 0))
    neu_scores.append(sentiment_score.get('neu', 0))
    pos_scores.append(sentiment_score.get('pos', 0))
    compound_scores.append(sentiment_score.get('compound', 0))

# Create a DataFrame
data = {
    'ID': ids,
    'Negative Score': neg_scores,
    'Neutral Score': neu_scores,
    'Positive Score': pos_scores,
    'Compound Score': compound_scores
}

df = pd.DataFrame(data)

# Save to CSV
df.to_csv('sentiment_scores.csv', index=False)

The final data's first few rows look like this: 

![](images/Sentiment_analysis.jpeg){width=25%, fig-align="center"}

However, the text data has also been kept in a data file with another code so that later analysis can be performed on it.

### API for cityofchicago.org

The following code extracted the following data frame about buses information and saved it into a csv file.

In [None]:
import pandas as pd
from sodapy import Socrata

client = Socrata("data.cityofchicago.org", None)

results = client.get("bynn-gwxy", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

# Save to CSV
results_df.to_csv('Chicago_avg_Buses.csv')

This is a snapshot of the data:

![](images/Chicago_avg_bus.jpeg){width=50%, fig-align="center"}

## Data from bts.gov

The files: bus_consumption_fuel_by_year.xlsx, energy_consumed_byMill_passenger_MILES.xlsx, fatalities_bus_over_time.xlsx, National_transport_usage_linked_Economic_trend.xlsx, and vehicle_production_countries.xlsx are downloaded from: https://www.bts.gov.

These are some snapshots of how the data looks like:

![](images/Bus_consumption.jpeg){width=25%, fig-align="center"}

![](images/energy.jpeg){width=25%, fig-align="center"}

![](images/facilities_bus.jpeg){width=25%, fig-align="center"}

![](images/National_transp_usage.jpeg){width=25%, fig-align="center"}

![](images/vehicle_prod.jpeg){width=25%, fig-align="center"}

## Data from International Transport Forum

The file Value_transport_by_countries.csv was downloaded from: https://stats.oecd.org/Index.aspx?DataSetCode=ITF_PASSENGER_TRANSPORT 

The data on that file looks like this:

![](images/value_transp_countries.jpeg){width=25%, fig-align="center"}

## Data from data.world

The file DC_Metro_Scorecard.xlsx and the zip folders: Walkable_distance_to_PubTrans, and capmetro_smart_trips_questionaire zip folders contain data that was downloaded from data.world. These are the three links where the data was downloaded from(in order of mention):

- https://data.world/makeovermonday/2016w51
- https://data.world/chhs/5e391154-f07d-4e0c-ab1b-687a0c4c5d06
- https://data.world/browning/capmetro-smart-trips-questionaire

The data looks like this (in order of mention):

![](images/DC_scorecard.jpeg){width=25%, fig-align="center"}

![](images/National_transp_usage.jpeg){width=25%, fig-align="center"}

![](images/Walkable_dist.jpeg){width=25%, fig-align="center"}

![](images/survey.jpeg){width=25%, fig-align="center"}