# COGS 108 - Final Project (change this to your project's title)

# Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Names

- Zoya Hasan
- Arushi Munjal 
- Shruti Yamala
- Siya Randhawa

# Abstract

Please write one to four paragraphs that describe a very brief overview of why you did this, how you did, and the major findings and conclusions.

# Research Question

Using sentiment analysis, which lyrical themes—such as love, empowerment, struggle, anger, hope, celebration, and nostalgia—are most commonly identified in the top songs of Spotify's English-language genres (Pop, Rap, Rock, R&B) from 2000 to 2023?


## Background and Prior Work

Streaming platforms like Spotify have transformed the music industry, impacting not only how music is consumed but also how data on listener preferences, song popularity, and musical trends are accessed. Understanding the factors that drive a song’s popularity can provide insights for artists and producers looking to create resonant music. While many elements influence popularity—including artist reputation and song structure—lyrics play a critical role by directly conveying emotions and themes that listeners connect with. Our project seeks to explore how specific lyrical themes correlate with song popularity across various genres, highlighting trends that resonate most with audiences on platforms like Spotify.

Previous studies have explored the role of sentiment analysis in predicting song popularity and genre classification. For instance, a study published in Ultimatics: Jurnal Teknik Informatika developed a BERT-based model to predict song popularity based on sentiment analysis of English song lyrics, achieving a notable accuracy of 87% through oversampling and data preprocessing techniques. This study found that the sentiment expressed in lyrics, such as positivity or negativity, was a significant factor in popularity. By capturing the sentiment with BERT, they successfully linked lyrical sentiment with popularity trends, underscoring the impact of lyrical emotion on audience engagement and song performance 1.

Another approach was undertaken by Boonyanit and Dahl at Stanford, who aimed to classify songs into genres based solely on lyrical content, using GloVe embeddings and LSTM models to predict genre with an accuracy of 68% at its peak. Their work illuminated the capacity of lyrics to signal genre-related characteristics, especially in distinguishing unique words and recurring themes. By focusing on genre classification, this study underscored how lyrical content often aligns with genre conventions, revealing differences in word choice and thematic style across genres like hip-hop, pop, and rock, despite overlaps 2.

Building on these studies, our project diverges by focusing not on predicting popularity or genre independently but on understanding how specific themes within lyrics correlate with popularity across genres like pop, hip-hop, rock, and country. We are not merely classifying songs by sentiment or genre; rather, we are examining genre as a contextual factor in lyrical themes. This will allow us to determine which themes—such as love, nostalgia, or resilience—drive higher engagement in particular genres, offering insights into the preferences of genre-specific audiences. Our findings can aid musicians in tailoring lyrics to align with listener tastes, leveraging data to enhance song impact on streaming platforms.

Sentiment Analysis on Song Lyrics for Song Popularity Prediction Using BERT Algorithm, Ultimatics: Jurnal Teknik Informatika, 2023. ↩
Music Genre Classification using Song Lyrics, Stanford CS224N Custom Project, 2023.

# Hypothesis


We hypothesize that lyrical themes identified through sentiment analysis will correlate distinctly with song popularity across Spotify’s English-language genres from 2000 to 2023. We expect themes like "love" and "nostalgia" to be more prevalent in Pop due to its universal appeal and focus on emotional connections. In contrast, "empowerment," "struggle," and "anger" will likely dominate in Rap, reflecting its emphasis on resilience, self-expression, and cultural commentary. For Rock, we anticipate themes of "anger," "struggle," and "celebration," aligning with its high-energy, rebellious tone. Similarly, "love," "hope," and "struggle" are predicted to appear most often in R&B, capturing its relational and soulful storytelling.

This prediction is based on our understanding of how different genres cater to specific emotional and cultural experiences. For example, Pop often explores personal connections, while Rap delves into themes of resilience and social commentary. Our thinking also draws from personal experiences and observations of recurring themes in chart-topping songs over the past decade.

# Data

## Data Overview #1


- Dataset #1
  - Dataset Name: Audio features and lyrics of Spotify songs
  - Link to the dataset: https://www.kaggle.com/datasets/imuhammad/audio-features-and-lyrics-of-spotify-songs
  - Number of observations: 18,454
  - Number of variables: 25

This dataset comprises 18,454 Spotify songs and provides a comprehensive set of variables, including metadata, audio features, and lyrics. Important variables in this dataset are track_popularity (a numerical score from 0 to 100 indicating song popularity), lyrics (text of song lyrics), and playlist_genre (categorical data representing the song’s primary genre).
 To prepare this dataset, we will need to clean and preprocess it in different ways to ensure uniformity in variable formats. We downloaded the csv file locally, and loaded it into Python as a dataframe using pandas methods.
 Data cleaning and wrangling includes converting timestamps to date objects and handling missing values, dropping columns that aren’t relevant to the research question (e.g. danceability), and including songs only released between 2013 and 2023.
 Text preprocessing on lyrics will include tokenizing, removing stop words, and applying a keyword-based approach and investigating sentiment analysis to identify lyrical themes.

## Dataset 1: Audio features and lyrics of Spotify songs

In [1]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
import pandas as pd
spotify_songs = pd.read_csv('/Users/shruti14/Downloads/spotify_songs.csv')
spotify_songs.head()
spotify_songs.shape
spotify_songs = spotify_songs[spotify_songs['language'] == 'en']
spotify_songs.head()
spotify_songs = spotify_songs.drop(columns=['mode', 'key', 'speechiness', 'loudness', 
                                            'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
                                           'danceability', 'energy', 'track_id','track_artist', 'track_album_id', 'track_album_name', 'playlist_name', 'playlist_id', 'playlist_subgenre', 'duration_ms', 'language'])
spotify_songs['track_album_release_date'] = pd.to_datetime(spotify_songs['track_album_release_date']).dt.year
spotify_songs = spotify_songs.rename(columns={'track_album_release_date': 'year_released', 'track_name': 'song'})
spotify_songs = spotify_songs[(spotify_songs['year_released'] >= 2000) & (spotify_songs['year_released'] <= 2023)]
spotify_songs = spotify_songs.reset_index(drop=True)
spotify_songs.head()
spotify_songs = spotify_songs[(spotify_songs['playlist_genre'] == 'rock') | (spotify_songs['playlist_genre'] == 'pop') |
                              (spotify_songs['playlist_genre'] == 'r&b') | (spotify_songs['playlist_genre'] == 'rap')]
spotify_songs.head()
spotify_songs['playlist_genre'].value_counts()
spotify_songs.head()

Unnamed: 0,song,lyrics,track_popularity,year_released,playlist_genre
0,I Feel Alive,"The trees, are singing in the wind The sky blu...",28,2017,rock
1,Poison,"NA Yeah, Spyderman and Freeze in full effect U...",0,2005,r&b
2,Baby It's Cold Outside (feat. Christina Aguilera),I really can't stay Baby it's cold outside I'v...,41,2012,r&b
3,Dumb Litty,Get up out of my business You don't keep me fr...,65,2019,pop
4,Soldier,"Hold your breath, don't look down, keep trying...",70,2019,r&b


## Dataset #2 Spotify Analysis and Visualization  


- Dataset #2
  - Dataset Name: Spotify Analysis and Visualization 
- Link to the dataset: https://www.kaggle.com/code/abdallahwagih/spotify-analysis-and-visualization 
  - Number of observations: 1879
  - Number of variables: 18

This dataset contains 1,879 songs categorized by genre and labeled with a popularity score ranging from 0 to 100. Key variables include song (song title), artist (artist name), genre (musical genre), and popularity (the target metric). Preprocessing steps involve handling missing or inconsistent genre labels, encoding categorical variables, and scaling the popularity scores. Additional steps may include feature engineering, such as extracting linguistic patterns from song titles, and balancing the dataset if popularity scores are skewed.We plan on cleaning this dataset after merging it with the dataset described below, see data description below to clarify data cleaning and wrangling of the merged datasets.

## Dataset 2:  Spotify Analysis and Visualization  

In [3]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
url = 'https://storage.googleapis.com/kagglesdsdata/datasets/2125460/3723559/songs_normalize.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20241208%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20241208T195828Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=6c8a92ba6bdd2af854970dda7b8f96aae3f38c4ee90319f6e7628ef3c7d115b9dc9ab8d68932ce17ee28d80c01d82748024db926aa02617fb23fb095c14627b745a2d6f164c8cc6479511317445f7bc93a778a51ec8b4fc3c071571d6a9c7675873a6fcbdbcc8488a963d94c78c55c6b62fcf65e03ed501bdd40897755ab89258e2c3e2ad363cc2a126a5452637e1f2d9c79d27eb3bf2af60200f47ac6fc6cd5c83133443032623d541a2a69dcf0bb1fd317429ff616c4da25a5b8e049e438109384fd5167547b954df1e5db394f61ef5e4ddec179d953ab0793b021f8d3fa604e33ee0089fea371273f1fb101ca69689b72e950edeb77680a3dd3e59ebf8d22'
songs_popularity_genres = pd.read_csv(url)
songs_popularity_genres.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


## Data Overview #3


- Dataset #3
  - Dataset Name: 150K Lyrics Labeled with Spotify Valence
  - Link to the dataset: https://www.kaggle.com/datasets/edenbd/150k-lyrics-labeled-with-spotify-valence/data  
  - Number of observations: 150,000
  - Number of variables: 5


This dataset contains 150,000 song lyrics labeled with Spotify Valence scores, which range from 0 to 1 and indicate the emotional positivity of a song (1 being highly positive). Important variables include artist (artist name), seq (song lyrics), song (song title), and label (valence score). The seq column requires preprocessing, such as tokenization, stopword removal, and text normalization, while label serves as the target for mood prediction. Additional steps include handling missing values, balancing the valence distribution, and potentially engineering features like word sentiment or linguistic complexity. After merging Spotify Analysis and Visualization with 150K Lyrics Labeled with Spotify Valence, we renamed the 'song', 'popularity', 'year', 'genre', 'seq' columns to match the columns from the first dataset, Audio features and lyrics of Spotify songs.  We then filtered the years to be between 2000 and 2023, and made sure to filter the dataframe to only include the pop, rock, r&b, and rap genres. We then merged that combined dataframe ( Spotify Analysis and 150K Lyrics Labeled with Spotify Valence) with the first dataset, Audio features and lyrics of Spotify songs to have one large, organized dataset that focuses on specific years and genres. 

## Dataset 3: 150K Lyrics Labeled with Spotify Valence

In [4]:
songs_lyrics = pd.read_csv('/Users/shruti14/Downloads/labeled_lyrics_cleaned.csv')
songs_popularity_genres.head()

Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,genre
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,0,0.0437,0.3,1.8e-05,0.355,0.894,95.053,pop
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,1,0.0488,0.0103,0.0,0.612,0.684,148.726,"rock, pop"
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,1,0.029,0.173,0.0,0.251,0.278,136.859,"pop, country"
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,0,0.0466,0.0263,1.3e-05,0.347,0.544,119.992,"rock, metal"
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,0,0.0516,0.0408,0.00104,0.0845,0.879,172.656,pop


## Final Merged Cleaned Dataset

In [5]:
merged_df = pd.merge(songs_popularity_genres, songs_lyrics, on='song', how='inner')
cleaned_df = merged_df[['song', 'popularity', 'year', 'genre', 'seq']].rename(columns={'seq': 'lyrics', 'popularity': 'track_popularity', 'year': 'year_released', 'genre':'playlist_genre'})
filtered_df = cleaned_df[(cleaned_df['year_released'] >= 2000) & (cleaned_df['year_released'] <= 2023)]
unique_songs_df = filtered_df.drop_duplicates(subset='song', keep='first')
#unique_songs_df['playlist_genre'].value_counts()
unique_genres = ['pop', 'rap','rock', 'r&b']

#Function to clean and simplify genre
def clean_genre(genre):
    #Take the first genre from the list (split by commas)
    first_genre = genre.split(',')[0].strip().lower()
    # Match it to the unique genres, if not found, return 'other'
    return first_genre if first_genre in unique_genres else 'other'

# Apply the function to the genre column
unique_songs_df['playlist_genre'] = unique_songs_df['playlist_genre'].apply(clean_genre)
unique_songs_df = unique_songs_df[(unique_songs_df['playlist_genre'] == 'rock') | (unique_songs_df['playlist_genre'] == 'pop') |
                              (unique_songs_df['playlist_genre'] == 'r&b') | (spotify_songs['playlist_genre'] == 'rap')]
unique_songs_df['playlist_genre'].value_counts()
unique_songs_df = unique_songs_df[unique_songs_df['playlist_genre'] != 'other']

merged_df = pd.concat([spotify_songs, unique_songs_df], ignore_index=True)
merged_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique_songs_df['playlist_genre'] = unique_songs_df['playlist_genre'].apply(clean_genre)
  unique_songs_df = unique_songs_df[(unique_songs_df['playlist_genre'] == 'rock') | (unique_songs_df['playlist_genre'] == 'pop') |


Unnamed: 0,song,lyrics,track_popularity,year_released,playlist_genre
0,I Feel Alive,"The trees, are singing in the wind The sky blu...",28,2017,rock
1,Poison,"NA Yeah, Spyderman and Freeze in full effect U...",0,2005,r&b
2,Baby It's Cold Outside (feat. Christina Aguilera),I really can't stay Baby it's cold outside I'v...,41,2012,r&b
3,Dumb Litty,Get up out of my business You don't keep me fr...,65,2019,pop
4,Soldier,"Hold your breath, don't look down, keep trying...",70,2019,r&b
...,...,...,...,...,...
10068,Wish You Well,He wrote a name\r\nWith the needle gun\r\nIn b...,64,2019,pop
10069,High Hopes,Run a mile run a mile\r\n'cause all the while\...,80,2018,rock
10070,How Do You Sleep?,"I know, I know, I know you want to see me fall...",73,2019,pop
10071,Sucker,New town and a new home to save your skin\r\nT...,79,2019,pop


In [13]:
merged_df[merged_df['playlist_genre'] == 'rock']['track_popularity'].max()

89

# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

## First Analysis You Did - Give it a better title

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [43]:
from nltk.stem import WordNetLemmatizer

In [49]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

import re

# Download stopwords
import nltk


lemmatizer = WordNetLemmatizer()

def preprocess_lyrics(lyrics):
     # Create a custom stopword list
    custom_stopwords = set(stopwords.words('english')).union({
    'oh', 'yeah', 'na', 'll', 'baby', 'let', 'ca', 'wan', 've', 'ai','way','come', 'ooh', 'gon', 'say', 'like', 'know', 'got', 'cause', 'im'
    })
      # Lowercase and remove contractions
    lyrics = lyrics.lower()
    lyrics = re.sub(r"'ll", '', lyrics)  # Remove 'll
    lyrics = re.sub(r"'ve", '', lyrics)  # Remove 've
    lyrics = re.sub(r"'re", '', lyrics)  # Remove 're
    lyrics = re.sub(r"'m", '', lyrics)   # Remove 'm
    lyrics = re.sub(r"'d", '', lyrics)   # Remove 'd
    lyrics = re.sub(r"'s", '', lyrics)   # Remove 's
    lyrics = re.sub(r"n't", ' not', lyrics)  # Replace n't with 'not'
    lyrics = re.sub(r"'", '', lyrics)  # Remove any remaining apostrophes

    
    # Add punctuation to stopwords
    custom_stopwords = custom_stopwords.union(set(string.punctuation))

    words = word_tokenize(lyrics.lower())
    words = [lemmatizer.lemmatize(word) for word in words if word not in custom_stopwords]
    return ' '.join(words)


merged_df['lyrics_cleaned'] = merged_df['lyrics'].apply(preprocess_lyrics)


# Display the first few rows
print(merged_df[['song', 'lyrics_cleaned']].head())


                                                song  \
0                                       I Feel Alive   
1                                             Poison   
2  Baby It's Cold Outside (feat. Christina Aguilera)   
3                                         Dumb Litty   
4                                            Soldier   

                                      lyrics_cleaned  
0  tree singing wind sky blue angel smiled saw lo...  
1  spyderman freeze full effect uh-huh ready ron ...  
2  really stay cold outside go away cold evening ...  
3  get business keep turning 모두 다 여긴 witness 넌 바른...  
4  hold breath look keep trying darling okay scar...  


In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorization with stopwords removal
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),  # Bi-grams
    min_df=5,           # Appear in at least 5 documents
    max_df=0.8          # Appear in at most 80% of documents
)
tfidf_matrix = vectorizer.fit_transform(merged_df['lyrics_cleaned'])

# Create a DataFrame for TF-IDF scores
tfidf_feature_names = vectorizer.get_feature_names_out()
tfidf_scores = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_feature_names)

# Display top words with highest average TF-IDF score
word_scores = tfidf_scores.mean().sort_values(ascending=False)
word_scores.head(50) # Print the top 20 words

love         0.037929
want         0.023914
time         0.022775
nigga        0.022566
feel         0.021147
make         0.018891
need         0.018729
girl         0.017615
right        0.016338
tell         0.015767
life         0.015518
away         0.015212
night        0.014774
day          0.014555
bitch        0.014393
heart        0.014170
oh           0.013987
think        0.013842
thing        0.013640
wo           0.013503
good         0.012901
shit         0.012806
fuck         0.012711
man          0.012571
mind         0.012532
hey          0.012298
said         0.012031
love love    0.012019
eye          0.011380
better       0.011163
look         0.011038
hold         0.010846
ta           0.010758
home         0.010567
world        0.010411
light        0.010252
little       0.010246
stay         0.009874
ya           0.009865
hand         0.009797
tonight      0.009760
boy          0.009501
long         0.009362
feeling      0.009349
leave        0.009346
em        

In [29]:
themes = {
    'love': ['love', 'heart', 'girl'],
    'empowerment': ['strong', 'fight', 'life'],
    'struggle': ['away', 'hold', 'leave'],
    'celebration': ['night', 'tonight', 'good'],
    'nostalgia': ['time', 'home', 'world']
}
#Make dictionaries longer and better, and make some method that only allows a song to be associated with the one theme that 
#it has the most words for

In [30]:
def check_theme(lyrics, theme_words):
    return any(word in lyrics.split() for word in theme_words)

# Add a column for each theme
for theme, words in themes.items():
    merged_df[theme] = merged_df['lyrics_cleaned'].apply(lambda x: check_theme(x, words))


In [32]:
# Filter popular songs
popular_songs = merged_df[merged_df['track_popularity'] >= 70]

# Group by genre and calculate theme occurrences
theme_summary = popular_songs.groupby('playlist_genre')[list(themes.keys())].sum()

# Display the theme summary
print(theme_summary)

                love  empowerment  struggle  celebration  nostalgia
playlist_genre                                                     
pop              590          262       373          400        499
r&b              189           83       101          121        164
rap              196           87       114           92        165
rock              73           44        63           50         91


In [58]:
popular_songs = merged_df[merged_df['track_popularity'] >= 50]
popular_songs.shape
#highest average popularity
#Threshold will be >= average popularity mertic across all songs (should have something to do with EDA)

(5074, 11)

In [59]:
#WHAT WE'RE DOING

# Got words with high tf-idf scores after preprocessing

# We have to figure out what words should correlate with the themes

#Make dictionaries longer and better, and make some method that only allows a song to be associated with the one theme that 
#it has the most words for

#Check lyrics of all songs for what themes correspond, in this method maybe we could incorporate only associating the song with one theme max
#based on the number of words it matches in each theme (ex if words in lyrics match those in love the most, ONLY love is the theme its assigned)

#Then we need to set a threshold for popular songs (I was thinking do >= average popularity across all the songs (maybe incorporate this in our EDA as well if not alr))

#Group by genre and calculate theme occurrences

#Create some sort of visualization to capture our findings

## Second Analysis You Did - Give it a better title

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [60]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

#Improvements here?

## ETC AD NASEUM

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [61]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

#Heatmap/Visualization

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Discussion and Conclusion

Wrap it all up here.  Somewhere between 3 and 10 paragraphs roughly.  A good time to refer back to your Background section and review how this work extended the previous stuff. 


# Team Contributions

Speficy who did what.  This should be pretty granular, perhaps bullet points, no more than a few sentences per person.