# INFO 2950: Phase 2 
#### Group Members: Anusha Bishayee, Katheryn Ding

---

### __Research Question:__  

#### How do ESG score and stock performance (price) align across different industries? What associations can we find between company industry, stock performance, and ESG ratings?
#### note: ESG score refers to a quantiative metric measuring a company's environmental, social, and governance performance; 'environmental' pertains to aspects like waste management and energy emissions, 'social' pertains to aspects like customer satisfaction and DEI in the workplace, and 'governance' pertains to aspects like operating efficiencies and risk management. esg scores are typically examined by investors, analysts, and competitiors to assess risk or opportunities associated with a specific company's practices.



---

### __Data Collection and Cleaning:__

In [1]:
import contextlib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
import sys
import warnings
import yfinance as yf

our original dataset with ESG information for different large/mid-cap companies came in a csv format, which we downloaded from Kaggle. this had about 722 rows, each corresponding to a unique publicly traded company. further description of the columns here can be found in the 'Dataset Description' portion of this notebook.

we first dropped all rows that had null values, which eliminated 27 companies. we then filtered this dataset for just USD currency, excluding companies that are traded in CNY or any other currency. this allows us to have greater familiarity with the industries and companies we analyze - this process eliminated 15 more of our rows, and left us with 680 companies.

In [2]:
esg = pd.read_csv("esg_data.csv")
print(f"original data shape: {esg.shape}")

esg = esg.dropna()
print(f"non-null data shape: {esg.shape}")

esg = esg[esg["currency"] == "USD"]
print(f"refined data shape: {esg.shape}")

original data shape: (722, 21)
non-null data shape: (695, 21)
refined data shape: (680, 21)


then, we converted the 'last_processing_date' column in our dataset to DateTime format (and m/d/y), and sorted the datase by ascending and descending 'last_processing_date' to see the range of processing dates in the data. 

In [3]:
esg["last_processing_date"] = pd.to_datetime(esg["last_processing_date"], format = "mixed")
esg["last_processing_date"] = esg["last_processing_date"].dt.strftime('%m-%d-%Y')

esg = esg.sort_values(by = "last_processing_date", ascending = False)
print(f"latest dates:\n{esg["last_processing_date"].head(2)}")

esg = esg.sort_values(by = "last_processing_date", ascending = True)
print(f"\nearliest dates:\n{esg["last_processing_date"].head(2)}")

latest dates:
720    11-15-2022
716    11-15-2022
Name: last_processing_date, dtype: object

earliest dates:
658    02-08-2022
36     04-16-2022
Name: last_processing_date, dtype: object


now, we want to add our finance data from the yfinance library onto to our esg dataset. we used the ticker column to match up companies from the yfinance library and our esg dataset, and we set our dates of the finance data to range from 1/1/21 to 12/31/22, as all of the 'last processing date' values for the esg data range from 2/8/22 to 11/15/22. in specific, we calculated a stock percentage change over this period for each company, a volatility index, a 50-day moving average, and a cumulative return.

In [4]:
# prevents some annoying yfinance outputs from printing - originally, the output took up like 23 pages
@contextlib.contextmanager
def suppress_output():
    with open(os.devnull, 'w') as devnull:
        old_stdout = sys.stdout
        old_stderr = sys.stderr
        sys.stdout = devnull
        sys.stderr = devnull
        try:
            yield
        finally:
            sys.stdout = old_stdout
            sys.stderr = old_stderr

tickers = esg["ticker"].tolist()
stock_data = []

for ticker in tickers:
    try:
        # suppresses all of the outputs when grabbing data from yfinance
        with suppress_output():  
            stock = yf.download(ticker, start = "2021-01-01", end = "2022-12-31", progress = False)
        
        if not stock.empty:
            # calculating percentage change
            percentage_change = ((stock["Close"].iloc[-1] - stock["Close"].iloc[0]) / stock["Close"].iloc[0]) * 100
            
            # calculating volatility (sd of daily returns)
            daily_returns = stock["Close"].pct_change()
            volatility = daily_returns.std()
            
            # calculating 50-day moving average
            stock["50_day_SMA"] = stock["Close"].rolling(window=50).mean()
            sma_50_day = stock["50_day_SMA"].iloc[-1]
            
            # calculating cumulative return
            cumulative_return = (stock["Close"].iloc[-1] / stock["Close"].iloc[0]) - 1
            
            stock_data.append({
                'ticker': ticker, 
                'percentage_change': percentage_change,
                'volatility': volatility,
                '50_day_SMA': sma_50_day,
                'cumulative_return': cumulative_return
            })
            
    # also helps to suppress annoying outputs
    except (yf.YFTzMissingError, yf.YFPricesMissingError):
        pass 

now, we need to convert the stock data we extracted from yfinance to a dataframe to merge it with our original esg dataframe.

In [9]:
stock_df = pd.DataFrame(stock_data)
merged_df = esg.merge(stock_df, on = 'ticker', how = 'left')
print(f"current data shape: {merged_df.shape}")

print(f"\n{merged_df.head()}")

current data shape: (680, 25)

  ticker                           name currency  \
0   poww                       Ammo Inc      USD   
1   acls       Axcelis Technologies Inc      USD   
2   achc  Acadia Healthcare Company Inc      USD   
3     cf     CF Industries Holdings Inc      USD   
4      t                       AT&T Inc      USD   

                        exchange           industry  \
0     NASDAQ NMS - GLOBAL MARKET   Leisure Products   
1     NASDAQ NMS - GLOBAL MARKET     Semiconductors   
2     NASDAQ NMS - GLOBAL MARKET        Health Care   
3  NEW YORK STOCK EXCHANGE, INC.          Chemicals   
4  NEW YORK STOCK EXCHANGE, INC.  Telecommunication   

                                                logo  \
0  https://static.finnhub.io/logo/8decc6ca0564a89...   
1  https://static.finnhub.io/logo/88b5f730-80df-1...   
2  https://static.finnhub.io/logo/4b6b2e5a4cfce5b...   
3  https://static.finnhub.io/logo/9b57a636-80eb-1...   
4  https://static.finnhub.io/logo/7d20269e-80

now, we need to drop any rows where the finance data has left null values. this eliminates 60 more of our rows, leaving us with 620 companies.

In [10]:
merged_df = merged_df.dropna()
print(f"non-null finance data shape: {merged_df.shape}")

non-null finance data shape: (620, 25)


this is still a lot of data, but will be helpful for getting industry-level and other general overviews of the data. merged_df will be our main dataset.

we also want to create a sample of these 620 companies so that we are able to look at trends and associations at individual company-level as well. to create our sample, we went through the industry list and selected 23 well-known companies from differing industries manually, however, we will probably refine this approach later (this ties into one of our questions for reviewers). sample_companies will be our 2nd dataset then.

In [12]:
companies = ["Walt Disney Co", "American Airlines Group Inc", "Apple Inc", "eBay Inc", "Goldman Sachs Group Inc", 
             "Meta Platforms Inc", "Starbucks Corp", "PayPal Holdings Inc", "United Airlines Holdings Inc", 
             "Bath & Body Works Inc", "Abbvie Inc", "Alexandria Real Estate Equities Inc", 
             "Becton Dickinson and Co", "Brown & Brown Inc", "Duke Energy Corp", "T-Mobile US Inc",
             "Marathon Oil Corp", "Chipotle Mexican Grill Inc", "Target Corp", 
             "General Motors Co", "Salesforce Inc", "Tesla Inc", "Bank of America Corp"]

sample_companies = merged_df[merged_df["name"].isin(companies)]
print(sample_companies)

    ticker                                 name currency  \
6      aal          American Airlines Group Inc      USD   
10    aapl                            Apple Inc      USD   
86     cmg           Chipotle Mexican Grill Inc      USD   
121    bro                    Brown & Brown Inc      USD   
130    bdx              Becton Dickinson and Co      USD   
133    bac                 Bank of America Corp      USD   
147    are  Alexandria Real Estate Equities Inc      USD   
161   abbv                           Abbvie Inc      USD   
189   ebay                             eBay Inc      USD   
231     gs              Goldman Sachs Group Inc      USD   
235     gm                    General Motors Co      USD   
273    duk                     Duke Energy Corp      USD   
362   pypl                  PayPal Holdings Inc      USD   
428    ual         United Airlines Holdings Inc      USD   
429   tmus                      T-Mobile US Inc      USD   
430   tsla                            Te

finally, we also extracted some general S&P 500 data from yfinance, ranging from the dates of 1/1/21 and 12/31/22 for the same reason. we are pulling this data so that we can compare stock performance of the individual companies to the overall S&P 500 in the same time range. sp500 will be our 3rd dataset.

In [14]:
sp500data = yf.download('^GSPC', start = '2021-01-01', end = '2022-12-31', progress = False)
sp500 = pd.DataFrame({
    'Date': sp500data.index,
    'Start Price': sp500data['Open'],
    'End Price': sp500data['Close'],
    'Rate of Change': ((sp500data['Close'] - sp500data['Open']) / sp500data['Open']) * 100 })

sp500.set_index('Date', inplace = True)
print(sp500)

            Start Price    End Price  Rate of Change
Date                                                
2021-01-04  3764.610107  3700.649902       -1.698986
2021-01-05  3698.020020  3726.860107        0.779879
2021-01-06  3712.199951  3748.139893        0.968157
2021-01-07  3764.709961  3803.790039        1.038063
2021-01-08  3815.050049  3824.679932        0.252418
...                 ...          ...             ...
2022-12-23  3815.110107  3844.820068        0.778745
2022-12-27  3843.340088  3829.250000       -0.366610
2022-12-28  3829.560059  3783.219971       -1.210063
2022-12-29  3805.449951  3849.280029        1.151771
2022-12-30  3829.060059  3839.500000        0.272650

[503 rows x 3 columns]


---

### __Exploratory Data Analysis__

#### part one - exploring different average environmental, social, governance, and total ESG scores by industry

In [None]:
avg_esg_by_industry = merged_df.groupby('industry')['total_score'].mean().reset_index()
avg_esg_by_industry.columns = ['Industry', 'Average Total ESG Score']

avg_esg_by_industry = avg_esg_by_industry.sort_values(by = 'Average Total ESG Score', ascending = False)
print("best average ESG scores")
print(avg_esg_by_industry.head(6))
print("")

avg_esg_by_industry = avg_esg_by_industry.sort_values(by = 'Average Total ESG Score', ascending = True)
print("worst average ESG scores")
print(avg_esg_by_industry.head(6))

In [None]:
plt.figure(figsize = (14, 10))
sns.barplot(x = 'Average Total ESG Score', y = 'Industry', data = avg_esg_by_industry, color = "#b97df3")
plt.title('Average Total ESG Score by Industry', horizontalalignment = 'center', fontsize = 16, fontweight = 'bold', )
plt.xlabel('Average Total ESG Score', fontsize = 14, fontweight = 'bold')
plt.ylabel('Industry Name', fontsize = 14, fontweight = 'bold')
plt.show()

interestingly (and somewhat predictably) - the industries with the lowest ESG scores are Metals & Mining, Aerospace & Defense, Diversified Consumer Services, Hotels, Restaurants & Leisure, Leisure Products, Auto Components, Airlines, and Automobiles. the industries with the highest ESG scores are Utilities, Tobacco, Industrial Conglomerates, Packaging, and Energy. 

future steps: sort different industries by just Environmental score, just Social score, and just Governance score to see if these differ significantly.

In [None]:
# Select relevant columns for correlation
score_columns = ['environment_score', 'social_score', 'governance_score', 'total_score']

# Compute the correlation matrix
corr_matrix = merged_df[score_columns].corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap to visualize the correlation matrix
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={"shrink": .8}, linewidths=0.5)

# Set titles and labels
plt.title('Correlation Matrix of ESG Scores', fontsize=16)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()



mess

In [None]:
#use yfinance to pull stock information of selected stocks.
esg.loc[:, 'ticker'] = esg['ticker'].astype(str)
esg.loc[:, 'name'] = esg['name'].astype(str)
tickers = esg['ticker'].tolist()

#Add Start Price, End Price, and Rate of Change (%) of each company to the dataset esg
esg.loc[:, 'Start Price'] = None
esg.loc[:, 'End Price'] = None
esg.loc[:, 'Rate of Change (%)'] = None

# Loop through each row of the DataFrame to get stock information for each company
for index, row in esg.iterrows():
    ticker = row['ticker']
    
    # Download stock data for 2023
    data = yf.download(ticker, start='2021-04-01',end='2022-04-01')
    
    # Ensure data exists for the given period
    if not data.empty:
        start_price = data['Adj Close'].iloc[0]
        end_price = data['Adj Close'].iloc[-1]
        rate_of_change = ((end_price - start_price) / start_price) * 100
        
        # Add the stock data to the relevant columns in the DataFrame using .loc[]
        esg.loc[index, 'Start Price'] = start_price
        esg.loc[index, 'End Price'] = end_price
        esg.loc[index, 'Rate of Change (%)'] = rate_of_change

print(esg)

In [None]:
#cleaning to only have certain companies that represent a variety of industries

companies = ["Walt Disney Co", "American Airlines Group Inc", "Apple Inc", "eBay Inc", "Goldman Sachs Group Inc", 
             "Meta Platforms Inc", "Starbucks Corp", "PayPal Holdings Inc", "United Airlines Holdings Inc", 
             "Bath & Body Works Inc", "Abbvie Inc", "Alexandria Real Estate Equities Inc", 
             "Becton Dickinson and Co", "Brown & Brown Inc", "Duke Energy Corp", "T-Mobile US Inc",
             "Marathon Oil Corp", "Chipotle Mexican Grill Inc", "Target Corp", 
             "General Motors Co", "Salesforce Inc", "Tesla Inc", "Bank of America Corp"]

relevant_esg = esg[esg["name"].isin(companies)]
print(relevant_esg)

In [None]:
#use yfinance to pull stock information of selected stocks.
relevant_esg.loc[:, 'ticker'] = relevant_esg['ticker'].astype(str)
relevant_esg.loc[:, 'name'] = relevant_esg['name'].astype(str)
tickers = relevant_esg['ticker'].tolist()
#Add Start Price, End Price, and Rate of Change (%) of each company to the dataset relevent.esg
relevant_esg.loc[:, 'Start Price'] = None
relevant_esg.loc[:, 'End Price'] = None
relevant_esg.loc[:, 'Rate of Change (%)'] = None

# Loop through each row of the DataFrame to get stock information for each company
for index, row in relevant_esg.iterrows():
    ticker = row['ticker']
    
    # Download stock data for 2023
    data = yf.download(ticker, start='2021-04-01',end='2022-04-01')
    
    # Ensure data exists for the given period
    if not data.empty:
        start_price = data['Adj Close'].iloc[0]
        end_price = data['Adj Close'].iloc[-1]
        rate_of_change = ((end_price - start_price) / start_price) * 100
        
        # Add the stock data to the relevant columns in the DataFrame using .loc[]
        relevant_esg.loc[index, 'Start Price'] = start_price
        relevant_esg.loc[index, 'End Price'] = end_price
        relevant_esg.loc[index, 'Rate of Change (%)'] = rate_of_change

print(relevant_esg)


In [None]:
#Add Start Price, End Price, and Rate of Change (%) of each company to the dataset esg
esg.loc[:, 'ticker'] = esg['ticker'].astype(str)
esg.loc[:, 'name'] = esg['name'].astype(str)
tickers = esg['ticker'].tolist()

# Add columns for Start Price, End Price, and Rate of Change (%)
esg['Start Price'] = None
esg['End Price'] = None
esg['Rate of Change (%)'] = None

# Loop through each row of the DataFrame to get stock information for each company
for index, row in esg.iterrows():
    ticker = row['ticker']
    
    # Skip rows where the ticker is not valid
    if ticker == 'nan' or ticker.strip() == "":
        continue
    
    # Download stock data for the given ticker
    data = yf.download(ticker, start='2021-04-01', end='2022-04-01')
    
    # Ensure data exists for the given period
    if not data.empty:
        start_price = data['Adj Close'].iloc[0]
        end_price = data['Adj Close'].iloc[-1]
        rate_of_change = ((end_price - start_price) / start_price) * 100
        
        # Add the stock data to the relevant columns in the DataFrame
        esg.loc[index, 'Start Price'] = start_price
        esg.loc[index, 'End Price'] = end_price
        esg.loc[index, 'Rate of Change (%)'] = rate_of_change

#Remove companies with missing value on stock information:
esg.dropna(subset=['Start Price', 'End Price', 'Rate of Change (%)'], inplace=True)

#show the first 15 rows of cleaned esg
print(esg.iloc[0:15,:])

#### part two

In [None]:
#Violin Plot: Governance Score by Environment Level
plt.figure(figsize=(12, 6))
sns.violinplot(data=esg, x='environment_level', y='governance_score', palette='muted')
plt.title('Governance Scores by Environment Level', fontsize=16)
plt.xlabel('Environment Level')
plt.ylabel('Governance Score')
plt.show()

In [None]:
# Plot the distribution of total_score
plt.figure(figsize=(10, 6))
sns.histplot(esg['total_score'], kde=True, bins=20)
plt.title('Distribution of Total ESG Scores', fontsize=16)
plt.xlabel('Total Score')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Boxplot of Total Scores by Total Grade
plt.figure(figsize=(12, 6))
sns.boxplot(data=esg, x='total_grade', y='total_score', palette='Set2')
plt.title('Total ESG Scores by Grade', fontsize=16)
plt.xlabel('Total Grade')
plt.ylabel('Total Score')
plt.show()

In [None]:
#Scatter Plot: Rate of Change vs. Total ESG Score
plt.figure(figsize=(10, 6))
sns.scatterplot(data=esg, x='total_score', y='Rate of Change (%)', hue='total_grade', palette='coolwarm', s=100)
plt.title('Rate of Change (%) vs. Total ESG Score', fontsize=16)
plt.xlabel('Total ESG Score')
plt.ylabel('Rate of Change (%)')
plt.legend(title="ESG Grade", loc='upper right')
plt.show()

In [None]:
#Boxplot: Rate of Change by Environment Grade
plt.figure(figsize=(12, 6))
sns.boxplot(data=esg, x='environment_grade', y='Rate of Change (%)', palette='Set2')
plt.title('Rate of Change (%) by Environmental Grade', fontsize=16)
plt.xlabel('Environment Grade')
plt.ylabel('Rate of Change (%)')
plt.show()


In [None]:
#Violin Plot: Rate of Change by Social Grade
plt.figure(figsize=(12, 6))
sns.violinplot(data=esg, x='social_grade', y='Rate of Change (%)', palette='muted')
plt.title('Rate of Change (%) by Social Grade', fontsize=16)
plt.xlabel('Social Grade')
plt.ylabel('Rate of Change (%)')
plt.show()


In [None]:
correlation_matrix=esg[['environment_score', 'social_score', 'governance_score', 'Rate of Change (%)']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Between ESG Subscores and Stock Price Change')
plt.xticks(rotation=45)

plt.show()

correlation_environment = esg['environment_score'].corr(esg['Rate of Change (%)'])
correlation_social = esg['social_score'].corr(esg['Rate of Change (%)'])
correlation_governance = esg['governance_score'].corr(esg['Rate of Change (%)'])

print(f'Correlation between Environmental Score and Rate of Change (%): {correlation_environment:.3f}')
print(f'Correlation between Social Score and Rate of Change (%): {correlation_social:.3f}')
print(f'Correlation between Governance Score and Rate of Change (%): {correlation_governance:.3f}')


In [None]:
# Plot histograms for each ESG subscore
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(esg['environment_score'].dropna(), bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Environmental Score')
plt.ylabel('Frequency')
plt.title('Distribution of Environmental Scores')

plt.subplot(1, 3, 2)
plt.hist(esg['social_score'].dropna(), bins=20, color='lightgreen', edgecolor='black')
plt.xlabel('Social Score')
plt.ylabel('Frequency')
plt.title('Distribution of Social Scores')

plt.subplot(1, 3, 3)
plt.hist(esg['governance_score'].dropna(), bins=20, color='lightcoral', edgecolor='black')
plt.xlabel('Governance Score')
plt.ylabel('Frequency')
plt.title('Distribution of Governance Scores')

plt.tight_layout()
plt.show()


In [None]:
env_scores = relevant_esg['environment_score'].values
soc_scores = relevant_esg['social_score'].values
gov_scores = relevant_esg['governance_score'].values
filtered_companies = relevant_esg['name'].values

x = np.arange(len(filtered_companies))

plt.figure(figsize=(12, 6))
plt.bar(x, env_scores, color='skyblue', label='Environmental Score')
plt.bar(x, soc_scores, bottom=env_scores, color='lightgreen', label='Social Score')
plt.bar(x, gov_scores, bottom=env_scores + soc_scores, color='lightcoral', label='Governance Score')

plt.xticks(x, filtered_companies, rotation=45, ha='right')
plt.xlabel('Company')
plt.ylabel('Score')
plt.title('Stacked Bar Chart of ESG Scores by Company')
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
data = yf.download(list(relevant_esg['ticker']), start='2023-01-01', end='2024-01-01')['Adj Close']
        

plt.figure(figsize=(14, 8))
for ticker in data.columns:
    plt.plot(data.index, data[ticker], label=ticker)

plt.xlabel('Date')
plt.ylabel('Adjusted Closing Price (USD)')
plt.title('Stock Price Changes Over Time for Selected Companies')
plt.legend(loc='upper left', fontsize='small')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
X = relevant_esg[['environment_score', 'social_score','governance_score']]   # Independent variables
y = relevant_esg['Rate of Change (%)']   # Dependent variable

model = LinearRegression().fit(X, y)
print(f"Environmental Score Coefficient: {model.coef_[0]}")
print(f"Social Score Coefficient: {model.coef_[1]}")
print(f"Governance Score Coefficient: {model.coef_[2]}")
print(f"Intercept: {model.intercept_}")


### Regression Coefficients Interpretation

For every 1-unit increase in the Environmental score (assuming all other factors remain constant), the stock return is expected to increase by 0.1079 units. The positive coefficient suggests that higher Environmental scores are associated with better stock performance (or higher returns).

For every 1-unit increase in the Social score (with other variables constant), the stock return is expected to decrease by 0.0237 units. The negative coefficient indicates that better Social scores might be associated with lower stock performance, but generally, since the coefficient is so close to 0, Social scores seem to have little impact on stock performance.

For every 1-unit increase in the Governance score (with other variables constant), stock return is expected to decrease by 0.0750 units. This negative coefficient suggests that improvements in governance (e.g., stricter regulation or more ethical practices) are associated with slightly lower stock returns. This could imply that governance improvements come at a financial cost.


In [None]:
esg_corr = relevant_esg.loc[:, ['environment_score', 'social_score',
                                'governance_score']] 
print(esg_corr.corr(numeric_only=True))

In [None]:
print(esg_corr.cov(numeric_only=True))

### Correlation and Covariance 

Let's observe the correlation/covariance between the different ESG scores themselves! All the correlations are positive, meaning that there is positive correlation between the scores (when one increases, the other does too). Governance and Environment have a particularly strong correlation, suggesting that companies who invest in environmental factors likely also care about governance (or perhaps some government regulations align with environmental issues). On the other hand, social factors seem to have just a moderately positive relationship with both the other variables.

While the covariance values agree with these claims, it's interesting to note how large the variance is for Environmental Scores (20758.63). This suggests that there is a large spread in environmental performance among the companies we chose, while governance scores are much more consistent.

---

### Data Description

This dataset contains ESG (Environmental, Social, and Governance) scores for 722 publicly traded companies, and represents a variety of different industries. Each row represents an individual company. The 21 columns include a CIK identifier, last processing date of the ESG data, company information (currency, logo/website URLs), environmental scores and ratings, social scores and ratings, governance scores and ratings, and overall ESG scores and ratings. We can combine this dataset with the Yahoo! Finance library in Python (yfinance, maintained by Ran Aroussi) to investigate if a company’s ESG score impacts its stock performance over time (time-series analysis): pypi.org/project/yfinance/. We do not intend to scrape data ourselves, but plan to utilize the yfinance library, which downloads market data from the Yahoo! Finance API, and appears to be free to use. We also will most likely filter this dataset from all 722 companies to just a few major companies of interest from select industries. Overall, our research question may be along the lines of “How do ESG ratings influence stock performance?” or “Do companies with high environmental ratings show less stock price volatility?”


We are using data from a Kaggle csv, and joining this to the yfinance library. First, we downloaded the ESG Kaggle csv, where each row corresponds to a different publicly traded company; this also contains ESG metrics for each company. We have filtered this dataset to only include companies that are traded in USD, and the columns all refer to a specific aspect of the company, such as industry, name, stock close/open/high/low prices, and ESG rating indexes. Then, we join this data to the yfinance library (Yahoo Finance) - from yfinance, we take the stock closing price of the company on 4/1/21 and 4/1/22. We also created a % Change variable, which measures the percentage change in the company stock closing price over the 2 years. All of this information is included in our large dataset - so the columns are company stock ticker, company name, currency, exchange, industry, logo, web url, environment grade, environment level, social grade, governance grade, governance level, environmental score, social score, governance score, total score, last processing date, total grade, total level, central index key, stock closing price in 4/1/21, stock closing price in 4/1/22, and the percent change between these. Our smaller dataset is meant to be a 'sample' of the original dataset with 722 rows, and involves a few companies that we hand-picked by notability. It contains all of the same columns, but only contains 23 companies. 

Our dataframes were created to show different stock prices of various companies over different date ranges, and the original ESG company data was created to compile ESG information for 700 mid / large-cap companies across various industries. Our combined dataframes was made to contrast both stock prices and ESG ratings of companies, and explore any associations. The original ESG company dataset was 'funded' by the efforts of Kaggle user Alistair King, and the yfinance dataset was created by Ran Aroussi as a way around the recent-ish Yahoo Finance API deprecation. For the ESG rating dataset, only mid/large-cap companies are included, so this influences the specific companies that are present in the dataset (the data that was observed and recorded) -- smaller companies will not be 'observed' here. 

The preprocessing was described above; we filtered our 722 row dataset for NaNs for our large dataset, and then for our sample dataset, we filtered all 722 companies down to 23 of interest, and joined all data to the yfinance library. Specifically, for each company, we acquired the stock closing price for 4/1/21 and 4/1/24, then created a stock change percentage variable between these two dates. 

Individuals are not involved in the data directly, as each observation corresponds to an entire company.

Our raw source data can be found in the yfinance library and https://www.kaggle.com/datasets/alistairking/public-company-esg-ratings-dataset/data, and the specific csv is here: https://github.com/phoebewang28/info-2950-project/blob/main/esg_data.csv. 




---

### Data Limitations
    
1. For our smaller dataset, the current sample of 23 companies was chosen manually by us as we wanted to get a range of industries that are well-known. This is limited to the companies that only use USD, and again, this data only records large/mid-cap companies, so this selection may not be fully representative of all US companies with ESG ratings. Depending on the results of of our analyses, we may consider random sampling or expanding the sample size to improve representativeness of our sample dataset.

2. We are currently comparing the rate of change of the sample stocks to the S&P 500, but other measures of stock performance might provide more valuable insights. For now, we are focusing on the rate of change between the closing prices of 4/1/21 and 4/1/22. 

3. Some stock data from yfinance is missing date information, which causes missing values when extracting prices. One company in our sample had this issue, so we had to exclude it to ensure consistency.

4. Since we are exploring potential connections between ESG ratings and company stock performance, we may need to sample not only by industry but also by ESG rating levels to ensure a more balanced and comprehensive analysis of the different ESG performance tiers (for our sample dataset)

5. ESG is a constent value retrieved from different days for each company in the month of April, 2022, while stock prices for these companies changes over time. We're unable to perform time-series analysis on ESG rating and stock informations due to the fact. 

6. ESG is evaluated annually, which might not be accuratly tracking the actual environmental performance of the company. Thus, when considering short-term impact of the company's esg policies, it's likely for that policy change to affect stock but not reflected on company's ESG rating.

---

### Questions for Reviewers

1. After combining our yfinance and ESG data, do we have large enough datasets to satisfy the complexity requirement for the research question?
   
2. Any advice recommanded to follow when we trying to take sample from the population? maybe by industry? how many industry to take sample from? maybe stratify to include all ESG grade? (as a reminder - we have one big dataset with 700+ companies and joined data that we did some EDA with, but we also want to include a smaller sample dataset so we can look at some individual companies as well).
   
3. How many visualization and statistics are recommanded for the final project? (ballpark range would be helpful) Do the visualizations we currently have seem like they're on the right path for the final phases?
   
4. Regarding the visualizations and chunks we have made for our EDA so far: should we explore these specific visualizations more in depth? OR should we expand our DA to other variables in the datasets that we maybe haven't used yet?