# Project 1 (Due Nov 13)

The goal of the first project is to non-parametrically model some phenomenon of interest, and generate sequences of values. There are six options below:

- Chordonomicon: 680,000 chord progressions of popular music songs. Create a chord generator, similar to what we did with Bach in class, but for a particular artist or genre. (https://github.com/spyroskantarelis/chordonomicon)
- Financial Time series, S&P500 Stocks: There are 500 time series here. Model how individual time series adjust over time, either together or separately. (https://www.kaggle.com/datasets/andrewmvd/sp-500-stocks)
- MIT-BIT Arrythmia Database: Arrythmia is an abnormal heart rhythm. This is a classic dataset that a day of ECG time series measurements for 4,000 patients. (https://www.physionet.org/content/mitdb/1.0.0/)
- Ukraine conflict monitor: The ACLED Ukraine Conflict Monitor provides near real-time information on the ongoing war in Ukraine, including an interactive map, a curated data file, and weekly situation updates Ukraine Conflict Monitor, maintained by the Armed Conflict Location & Event Data Project, starting in 2022, including battles, explosions/remote violence, violence against civilians, protests, and riots: https://acleddata.com/monitor/ukraine-conflict-monitor
SIPRI Arms Trade: The SIPRI Arms Transfers Database is a comprehensive public resource tracking all international transfers of major conventional arms from 1950 to the present. For each deal, information includes: number ordered, supplier/recipient identities, weapon types, delivery dates, and deal comments. The database can address questions about: who are suppliers and recipients of major weapons, what weapons have been transferred by specific countries, and how supplier-recipient relationships have changed over time. https://www.sipri.org/databases/armstransfers
- Environmental Protection Agency data: The EPA, in general, has excellent data on the release of toxic substances, and I also tracked down air quality and asthma. You can put these together to look at how changes in toxic release correlate with air quality and respiratory disease over time: https://www.epa.gov/data https://www.epa.gov/toxics-release-inventory-tri-program/tri-toolbox https://www.cdc.gov/asthma/most_recent_national_asthma_data.htm https://www.earthdata.nasa.gov/topics/atmosphere/air-quality/data-access-tools

If you have other data sources that you're interested in, I am willing to consider them, as long as they lend themselves to an interesting analysis.

Submit a document or notebook that clearly addresses the following:

1. Describe the data clearly -- particularly any missing data that might impact your analysis -- and the provenance of your dataset. Who collected the data and why? (10/100 pts)
2. What phenomenon are you modeling? Provide a brief background on the topic, including definitions and details that are relevant to your analysis. Clearly describe its main features, and support those claims with data where appropriate. (10/100 pts)
3. Describe your non-parametric model (empirical cumulative distribution functions, kernel density function, local constant least squares regression, Markov transition models). How are you fitting your model to the phenomenon to get realistic properties of the data? What challenges did you have to overcome? (15/100 pts)
4. Either use your model to create new sequences (if the model is more generative) or bootstrap a quantity of interest (if the model is more inferential). (15/100 pts)
5. Critically evaluate your work in part 4. Do your sequences have the properties of the training data, and if not, why not? Are your estimates credible and reliable, or is there substantial uncertainty in your results? (15/100 pts)
6. Write a conclusion that explains the limitations of your analysis and potential for future work on this topic. (10/100 pts)

In addition, submit a GitHub repo containing your code and a description of how to obtain the original data from the source. Make sure the code is commented, where appropriate. Include a .gitignore file. We will look at your commit history briefly to determine whether everyone in the group contributed. (10/100 pts)

In class, we'll briefly do presentations and criticize each other's work, and participation in your group's presentation and constructively critiquing the other groups' presentations accounts for the remaining 15/100 pts.

In [1]:
import pandas as pd
import numpy as np

In [3]:
companies = pd.read_csv('sp500_companies.csv')
index = pd.read_csv('sp500_index.csv')
stocks = pd.read_csv('sp500_stocks.csv')

In [5]:
companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Exchange             502 non-null    object 
 1   Symbol               502 non-null    object 
 2   Shortname            502 non-null    object 
 3   Longname             502 non-null    object 
 4   Sector               502 non-null    object 
 5   Industry             502 non-null    object 
 6   Currentprice         502 non-null    float64
 7   Marketcap            502 non-null    int64  
 8   Ebitda               473 non-null    float64
 9   Revenuegrowth        499 non-null    float64
 10  City                 502 non-null    object 
 11  State                482 non-null    object 
 12  Country              502 non-null    object 
 13  Fulltimeemployees    493 non-null    float64
 14  Longbusinesssummary  502 non-null    object 
 15  Weight               502 non-null    flo

In [6]:
index.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2517 entries, 0 to 2516
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    2517 non-null   object 
 1   S&P500  2517 non-null   float64
dtypes: float64(1), object(1)
memory usage: 39.5+ KB


In [12]:
stocks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1891536 entries, 0 to 1891535
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Date       object 
 1   Symbol     object 
 2   Adj Close  float64
 3   Close      float64
 4   High       float64
 5   Low        float64
 6   Open       float64
 7   Volume     float64
dtypes: float64(6), object(2)
memory usage: 115.5+ MB


In [11]:
stocks.isnull().sum()


Date               0
Symbol             0
Adj Close    1273705
Close        1273705
High         1273705
Low          1273705
Open         1273705
Volume       1273705
dtype: int64

The data consists of three related datasets detailing the S&P 500 and its constituent companies. 

The *companies* dataset contains information on 502 companies that are part of the S&P 500 index. It includes details like stock symbol, short and long company names, sector and industry, current price, market capitalization, EBITDA, revenue growth, city, state, country, full-time employees, a business summary, and the company's weight in the index. Some columns have missing data:
- *ebitda* has 29 missing values
- *Revenuegrowth* has 3 missing values
- *State* has 20 missing values
- *Fulltimeemployees* has 9 missing values
These missing values could affect analyses that rely on financial metrics or company location and size. 

The *index* dataset contains historical daily values of the S&P 500 index over 2517 trading days. The columns include Date and S&P closing values. There aren't any missing values in this dataset. 

The *stocks* dataset provides daily stock trading data for all S&P 500 companies, including Date, Symbol, Adj Close, Close, High, Low, Open, and Volume. The dataset contains 1,891,535 rows, but many of the numeric trading columns have missing values. 1,273,705 entries are null in each of Adj Close, Close, High, Low, Open, and Volume, while Date and Symbol are complete. These missing values could impact analyses that rely on complete price or volume data. 

The S&P 500 index is a widely recognized financial benchmark tracking the performance of large U.S. companies. The dataset likely comes from publicly available financial databases (such as Yahoo Finance) and was collected to enable analysis of stock performance, trading behavior, and the relationship between individual stocks and overall market trends. Analysts or researchers can use it to study financial patterns, market volatility, or trends in company performance over time.
