# Phase II: Data Curation, Exploratory Analysis and Plotting (5\%)

### Team Members:
- Colin Hui
- Derek Aslan
- Aydan Ali
- Conor Cummings


## Part 1: 
(1%) Expresses the central motivation of the project and explains the (at least) two key questions to be explored. Gives a summary of the data processing pipeline so a technical expert can easily follow along.

## Problem Motivation 

Global economic development remains uneven, especially in today's world where countries  are experiencing different levels of population wellbeing despite similar geopolitical circumstances. Understanding the relationships between economic indicators, political factors, and policy decisions is important for identifying pathways to improved quality of life. This analysis would focus on how these indicators and factors influence population wellbeing across different world regions/countries, overall providing insights that could inform evidence-based policy decisions for sustainable development.
The key questions are as follows:

1. Are there peaks in bike usage during certain times of the day?
2. Which stations have the highest traffic, and how does the distribution of bike usage vary geographically? Can we identify high-demand areas that would benefit from more bikes?
3. How does bike trip duration vary by user type (eg. age, gender, membership status)?
4. Can we predict the trip duration based on factors such as the time of day, user demographics, and starting/ending stations?

Motivating sources:
- Northeastern University. "Améera." Northeastern University Landscapes, Northeastern University, https://landscapes.northeastern.edu/ameera/. Accessed 30 Sept. 2024.
- "Environmental Awareness Month: Go Green with Blue Cross and Bluebikes in September." Blue Bikes Boston, http://blog.bluebikes.com/blog/environmental-awareness-month-go-green-with-blue-cross-and-bluebikes-in-september. Accessed 30 Sept. 2024.
- Streets Cabinet. "Streets Cabinet Announces Bluebikes Expansion Planning." Boston.gov, 12 Aug. 2024, https://www.boston.gov/news/streets-cabinet-announces-bluebikes-expansion-planning. Accessed 30 Sept. 2024.

## Summary of the Data Processing Pipeline

1. Web scrape to get the raw data
2. Clean the data to prepare the data frame for visualization and analysis
3. Visualize using plotting libraries, such as Seaborn, Plotly, and Matplotlib

To process the data, we will first acquire the Bluebikes datasets. This involves web scraping to successfully collect the data, cleaning the data, and then saving the datasets as (cleaned) .csv files and importing them into our Jupyter Notebook. The cleaning process consists of removing any invalid values, including NaN, n/a, and 0 when applicable. This step will also involve handling missing or inconsistent values across columns such as start station ID, end station ID, and bikeId, converting the starttime and stoptime columns to DateTime format, and removing unanalyzable records like trips with negative tripduration. Next, we will address our key questions. This includes extracting the hour and day of the week from the start time and calculating the user’s age from their birth year to help us analyze usage patterns more effectively. Seeing that there is data in several files, we will combine them using common identifiers like the start station ID and end station ID. After that, we will perform more data analysis to create basic statistics and visualizations. This will include time series plots showing peaks in bike usage and geographical heatmaps (using imported libraries Seaborn, Matplotlib, and Plotly) displaying station traffic based on the names of the start and end stations. These steps will help us address key questions related to peak usage times, station demand, and trip duration patterns across different user demographics. Finally, the cleaned data will be prepared for machine learning by selecting relevant features for predictive modeling of blue bike trip duration.

## Part 2: 
(2\%) Obtains, cleans, and merges all data sources involved in the project.

In [1]:
import pandas as pd
import requests
import json



indicators = {
    'SP.POP.TOTL': 'Population, total',
    'NY.GNP.PCAP.CD': 'GDP Per Captia',
    'SI.POV.DDAY': 'Poverty headcount ratio at $3.00 a day (2021 PPP)',
    'SI.POV.GINI': 'Gini index',
    'MS.MIL.XPND.GD.ZS': 'Military expenditure (% of GDP)',
    'VA.EST': 'Voice and Accountability: Estimate'
}

params = {
    'format': 'json',
    'per_page': '300', # This makes sure all countries are returned
    'date': '2023' #Query just one year
}


def get_api_url(indicator, params):
    """
    Constructs a URL for the API call, to query a given indicator and with a given set of parameters.

    Args:
        indicator: the indicator ID string
        params: a dictionary containing the API call parameters

    Returns:
        A URL to send an HTTP request to to get the API data
    """
    baseurl = 'https://api.worldbank.org/v2/country/all/indicator/'

    url = baseurl + indicator + '?'
    for param in params.keys():
        url = url + param + '=' + params[param] + '&'
    return url

"""
You said you want the region only - are you thinking that were gonna group the countries by region and 
use one of the indicators and then get like the mean of that indicator for each region and then do like a 
grouped bar plot
- if so then after we merge, ill groupby region 

"""
def get_country_data():
    """
    Gets country region and income level as a dataframe indexed by country id.

    Args: 
        None

    Returns:
        DataFrame with country id as index and region/incomeLevel as columns

    """
    response = requests.get("https://api.worldbank.org/v2/country?format=json&per_page=296")
    country_data = json.loads(response.text)[1]

    country_dct = {}

    for dict in country_data:
        if dict["region"]["value"] != "Aggregates":
            country_dct[dict["iso2Code"]] = {
                "region": dict["region"]["value"],
                "incomeLevel": dict["incomeLevel"]["value"],
            }
    df = pd.DataFrame.from_dict(country_dct, orient = "index")
                
    print(df)

    return df

def get_indicator_data(indicators):
    """
    Gets and cleans the indicator data into a dataframe indexed by country id.

    Args:
        indicator: the indicator ID string

    Returns:
        Dataframe with country id as index and the indicators as columns

    """


    indicator_data = {}
    #Loop through each indicator and make an API call for each. Unfortunately, each API call can only return data for 1 indicator.
    for indicator in indicators.keys():
        indicator_data[indicator] = json.loads(requests.get(get_api_url(indicator, params)).text)

    indicator_series_list = []

    for indicator in indicator_data.keys():
        indicator_dict = {}
        for country in indicator_data[indicator][1]:
            country_id = country['country']['id']
            if len(country_id) == 2 and country_id.isalpha() and country_id.isupper():
                indicator_dict[country_id] = country['value']

        indicator_series = pd.Series(indicator_dict)
        indicator_series.name = indicator
        indicator_series_list.append(indicator_series)
    output = pd.DataFrame(indicator_series_list).transpose()
    print(output)
    return output


def merge_data(country_df, indicator_df):
    """
    Merges country dataframe with indicator dataframe based on the country id index.
    
    Args:
        country_df: DataFrame with region/income data
        indicator_df: DataFrame with indicator data
        
    Returns:
        finalized merged dataframe containing country and indicator data
    """
    merge_df = pd.merge(indicator_df, country_df, left_index = True,
                         right_index = True, how = "right")
    return merge_df



if __name__ == "__main__":
    country_df = get_country_data()
    indicator_df = get_indicator_data(indicators)
    merged_data = merge_data(country_df, indicator_df)
    print(merged_data)

    merged_data.to_csv('world_bank_data_2023.csv')

                                               region          incomeLevel
AW                         Latin America & Caribbean           High income
AF  Middle East, North Africa, Afghanistan & Pakistan           Low income
AO                                Sub-Saharan Africa   Lower middle income
AL                              Europe & Central Asia  Upper middle income
AD                              Europe & Central Asia          High income
..                                                ...                  ...
XK                              Europe & Central Asia  Upper middle income
YE  Middle East, North Africa, Afghanistan & Pakistan           Low income
ZA                                Sub-Saharan Africa   Upper middle income
ZM                                Sub-Saharan Africa   Lower middle income
ZW                                Sub-Saharan Africa   Lower middle income

[217 rows x 2 columns]
    SP.POP.TOTL  NY.GNP.PCAP.CD  SI.POV.DDAY  SI.POV.GINI  MS.MIL.XPND.GD.ZS

## Part 3:
(2\%) Builds at least two visualizations (graphs/plots) from the data which help to understand or answer the questions of interest. These visualizations will be graded based on how much information they can effectively communicate to readers. Please make sure your visualization are sufficiently distinct from each other.