## 7. Group Assignment & Presentation



__You should be able to start up on this exercise after Lecture 1.__

*This exercise must be a group effort. That means everyone must participate in the assignment.*

In this assignment you will solve a data science problem end-to-end, pretending to be recently hired data scientists in a company. To help you get started, we've prepared a checklist to guide you through the project. Here are the main steps that you will go through:

1. Frame the problem and look at the big picture
2. Get the data
3. Explore and visualise the data to gain insights
4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models
7. Present your solution 

In each step we list a set of questions that one should have in mind when undertaking a data science project. The list is not meant to be exhaustive, but does contain a selection of the most important questions to ask. We will be available to provide assistance with each of the steps, and will allocate some part of each lesson towards working on the projects.

Your group must submit a _**single**_ Jupyter notebook, structured in terms of the first 6 sections listed above (the seventh will be a video uploaded to some streaming platform, e.g. YouTube, Vimeo, etc.).

### 1. Analysis: Frame the problem and look at the big picture
1. Find a problem/task that everyone in the group finds interesting
2. Define the objective in business terms
3. How should you frame the problem (supervised/unsupervised etc.)?
4. How should performance be measured?

#### 1. Identifying an Interesting Problem/Task
As a team of newly hired data scientists at MetaData Inc.,(Meta), we have identified a crucial and challenging task: analyzing user post content for sentiment. Our goal is to understand the general mood and opinions expressed by users on the platform. This is particularly important for MetaData, as it can provide insights into user engagement, detect emerging trends, and even identify potential areas of concern (like the spread of negative sentiment or harmful content which of course we personally dont like).

##### 2. Defining the Objective in Business Terms
The primary business objective of our sentiment analysis project is to develop a robust model that can classify user posts into various sentiment categories such as positive, negative, or neutral. This model will help MetaData in several ways:

Enhancing User Experience: By understanding the sentiment of posts, MetaData can tailor the user experience, like suggesting more content that aligns with the user’s interests or mood.
Content Moderation: Sentiment analysis can flag potentially harmful or negative content for further review, thereby maintaining a healthy online community.
Marketing Insights: Analyzing sentiment can provide valuable insights for advertisers and partners about public opinion on various topics, products, or services.
This can be then used on both posts and comments or any other interactions with the posts.

##### 3. Framing the Problem
For MetaData, sentiment analysis of user posts is a classic case of supervised learning, as it involves categorizing text data into predefined sentiment categories. We will train our model on a dataset where each post is labeled with a sentiment, allowing the model to learn and predict the sentiment of new, unlabeled posts. This approach aligns with the supervised learning paradigm where the input (user post) and output (sentiment label) are clearly defined. We will probably be looking at classifying into multiple classes, as we want to categorize beyond just positive/negative.

##### 4. Measuring Performance
To evaluate the performance of our sentiment analysis model, we will focus on the following metrics:

- Accuracy: This will provide a basic understanding of how often our model correctly classifies a post’s sentiment.
- Precision and Recall: These metrics are crucial for understanding the model’s performance in classifying specific sentiments, especially in a dataset that may have uneven representation of sentiments.
- F1 Score: As a harmonic mean of precision and recall, the F1 score will help us balance the trade-off between these two metrics, which is vital in ensuring that our model is not biased toward a particular sentiment.
- Confusion Matrix: This will help us visualize the performance of the model in terms of false positives and false negatives, giving us insights into specific areas where the model might need improvement.
- By focusing on these metrics, we can refine our model to ensure it meets the business needs of MetaData while maintaining high standards of accuracy and reliability.

------------------------------------

### 2. Get the data
1. Find and document where you can get the data from
2. Get the data
3. Check the size and type of data (time series, geographical etc)

##### 1. Finding and Documenting Data Sources
As part of our project at MetaData Inc., we have identified two promising datasets from Hugging Face, a renowned platform for machine learning models and datasets.

- Dataset A: This dataset is relatively large, containing approximately 400,000 entries. However, it primarily includes sentences framed around personal experiences and feelings, often starting with "I" (e.g., "I'm feeling quite sad and sorry for myself but I'll snap out of it soon"). This dataset provides a deep insight into individual sentiment but may have limitations in terms of diversity and scope of expressions.
- Dataset B: The second dataset, although smaller with around 200,000 entries, offers a broader range of content. It includes various types of text, not just personal statements, which makes it more general and potentially more representative of the diverse content on social media. However, this dataset categorizes emotions into more classes than Dataset A, which could add complexity to our analysis and model training.
##### 2. Acquiring the Data
We have downloaded both datasets from Hugging Face. Given the nature of our project and the need for a comprehensive understanding of user sentiments on a social media platform, having two datasets with different characteristics is beneficial. Dataset A provides depth in personal sentiment expression, while Dataset B offers breadth and variety.

##### 3. Checking Size and Type of Data
2. Dataset A 'emotion':
- Size: Approximately 400,000 entries
- Type: The data seems to be predominantly subjective personal statements with a focus on individual emotions and experiences. This dataset may lack diversity in sentence structure and context, as it is centered around first-person expressions.
- This dataset is clean of unimportant collumns - there is only the text and a label to go with it.

2. Dataset B 'go_emotions':
- Size: Approximately 200,000 entries
- Type: This dataset is more varied, not just in terms of the emotional spectrum but also in the types of sentences and contexts. It includes a wider range of expressions, which might be more reflective of general user posts on social media. However, the additional emotion categories present a challenge in terms of classification complexity.
- This one is relatively more filled with unimportant columns for our analysis, and here we can see that there is much more categorizations for the emotions


For our sentiment analysis project, both datasets will be crucial. Dataset A can help us understand personal sentiments in depth, while Dataset B will allow us to explore a broader range of expressions and contexts. The next steps will involve exploring these datasets in more detail and preparing them for analysis, considering their unique characteristics and how they complement each other in addressing our project's objectives.

In [27]:
!pip3 install pandas
!pip3 install datasets

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[K     |████████████████████████████████| 521 kB 2.1 MB/s eta 0:00:01
[?25hCollecting pyarrow>=8.0.0
  Downloading pyarrow-14.0.1-cp39-cp39-macosx_11_0_arm64.whl (24.0 MB)
[K     |████████████████████████████████| 24.0 MB 5.9 MB/s eta 0:00:01█████▌        | 17.6 MB 5.9 MB/s eta 0:00:02
[?25hCollecting pyarrow-hotfix
  Downloading pyarrow_hotfix-0.5-py3-none-any.whl (7.8 kB)
Collecting xxhash
  Downloading xxhash-3.4.1-cp39-cp39-macosx_11_0_arm64.whl (30 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.15-py39-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 14.1 MB/s eta 0:00:01
[?

In [None]:
from datasets import load_dataset

In [55]:
emotion_dataset = load_dataset("dair-ai/emotion", "unsplit")

Downloading builder script: 100%|██████████| 3.97k/3.97k [00:00<00:00, 14.2MB/s]
Downloading metadata: 100%|██████████| 3.28k/3.28k [00:00<00:00, 23.8MB/s]
Downloading readme: 100%|██████████| 8.78k/8.78k [00:00<00:00, 21.9MB/s]
Downloading data: 100%|██████████| 15.4M/15.4M [00:00<00:00, 33.0MB/s]
Downloading data files: 100%|██████████| 1/1 [00:02<00:00,  2.27s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00,  9.22it/s]
Generating train split: 100%|██████████| 416809/416809 [00:05<00:00, 80618.06 examples/s]


In [59]:
# loading emotion dataset
import pandas as pd

emotion_raw = emotion_dataset['train']
emotion_raw_df = pd.DataFrame(emotion_raw)

In [63]:
emotion_raw_df.head()

Unnamed: 0,text,label
0,i feel awful about it too because it s my job ...,0
1,im alone i feel awful,0
2,ive probably mentioned this before but i reall...,1
3,i was feeling a little low few days back,0
4,i beleive that i am much more sensitive to oth...,2


In [57]:
go_emotions_dataset = load_dataset("go_emotions", "raw")

In [58]:
# Loading go_emotions dataset
import pandas as pd

go_emotions_raw = go_emotions_dataset['train']
go_emotions_raw_df = pd.DataFrame(go_emotions_raw)

In [68]:
pd.set_option('display.max_columns', None)
go_emotions_raw_df.head()

Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1548381000.0,1,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1548084000.0,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1546428000.0,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1547965000.0,18,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1546669000.0,2,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [79]:
emotion_raw_df.to_csv('../exports/emotion_raw.csv')
go_emotions_raw_df.to_csv('../exports/go_emotions_raw.csv')

#### Downsampling for better exploration

In [81]:
## dataset down-sampling

fraction = 0.05

emotion_df_slim = emotion_raw_df.sample(frac=fraction)

print(f"Original size: {len(emotion_raw_df)}")
print(f"Slimmed down size: {len(emotion_df_slim)}")

Original size: 416809
Slimmed down size: 20840


In [78]:
## dataset down-sampling

fraction = 0.1

go_emotions_df_slim = go_emotions_raw_df.sample(frac=fraction)

print(f"Original size: {len(go_emotions_raw_df)}")
print(f"Slimmed down size: {len(go_emotions_df_slim)}")

Original size: 211225
Slimmed down size: 21122


In [82]:
emotion_df_slim.to_csv('../exports/emotion_slim.csv')
go_emotions_df_slim.to_csv('../exports/go_emotions_slim.csv')

------------------

### 3. Explore the data
1. Create a copy of the data for explorations (sampling it down to a manageable size if necessary)
2. Create a Jupyter notebook to keep a record of your data exploration
3. Study each feature and its characteristics:
    * Name
    * Type (categorical, int/float, bounded/unbounded, text, structured, etc)
    * Percentage of missing values
    * Check for outliers, rounding errors etc
4. For supervised learning tasks, identify the target(s)
5. Visualise the data
6. Study the correlations between features
7. Identify the promising transformations you may want to apply (e.g. convert skewed targets to normal via a log transformation)
8. Document what you have learned

In [76]:
go_emotions_raw_df.head()

Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1548381000.0,1,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1548084000.0,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1546428000.0,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1547965000.0,18,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1546669000.0,2,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [50]:
go_emotions_cleaned_df = go_emotions_raw_df.copy()

go_emotions_cleaned_df.drop(columns=['created_utc', 
                                 'rater_id', 
                                 'id', 
                                 'author', 
                                 'subreddit', 
                                 'link_id',
                                 'parent_id'], 
                        inplace=True)

In [52]:
go_emotions_cleaned_df = go_emotions_cleaned_df[go_emotions_cleaned_df['example_very_unclear'] != True]
go_emotions_cleaned_df.drop(columns=['example_very_unclear'], inplace=True)

In [54]:
pd.set_option('display.max_columns', None)
go_emotions_cleaned_df.head()

Unnamed: 0,text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [None]:
# Count the frequency of each label within each dataset
label_dataset_counts = go_emotions_cleaned_df.groupby(['dataset', 'label']).size().unstack(fill_value=0)

label_dataset_counts



go_emotions has these emotions:
- admiration
- amusement
- anger
- annoyance
- approval
- caring
- confusion
- curiosity
- desire
- disappointment
- disapproval
- disgust
- embarrassment
- excitement
- fear
- gratitude
- grief
- joy
- love
- nervousness
- optimism
- pride
- realization
- relief
- remorse
- sadness
- surprise
- neutral

- 0: 'sadness'
- 1: 'joy'
- 2: 'love'
- 3: 'anger'
- 4: 'fear'
- 5: 'surprise'

### possible mapping the go_emotions to the second shorter categories:

### 0: 'sadness'
- disappointment
- grief
- remorse
- sadness

### 1: 'joy'
- amusement
- excitement
- joy
- optimism

### 2: 'love'
- caring
- love

### 3: 'anger'
- anger
- annoyance
- disapproval (could be related to anger)
- disgust (could be related to anger)

### 4: 'fear'
- fear
- nervousness (could be related to fear)

### 5: 'surprise'
- surprise


not categorized:
admiration
approval
confusion
curiosity
desire
embarrassment
gratitude
pride
realization
relief
neutral

[Separate notebook with this step - ./3-data-exploration.ipynb](./3-data-exploration.ipynb)

------------------

### 4. Prepare the data
Notes:
* Work on copies of the data (keep the original dataset intact).
* Write functions for all data transformations you apply, for three reasons:
    * So you can easily prepare the data the next time you run your code
    * So you can apply these transformations in future projects
    * To clean and prepare the test set
    
    
1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)
2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

[Separate notebook with this step - ./4-preparing-data.ipynb](./4-preparing-data.ipynb)

------------------

### 5. Short-list promising models
We expect you to do some additional research and train at **least one model per team member**.

1. Train mainly quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc) using default parameters
2. Measure and compare their performance
3. Analyse the most significant variables for each algorithm
4. Analyse the types of errors the models make
5. Have a quick round of feature selection and engineering if necessary
6. Have one or two more quick iterations of the five previous steps
7. Short-list the top three to five most promising models, preferring models that make different types of errors

[Separate notebook with this step - ./5-training-models.ipynb](./5-training-models.ipynb)

------------------

### 6. Fine-tune the system
1. Fine-tune the hyperparameters
2. Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error

------------------

### 7. Present your solution
1. Document what you have done
2. Create a nice 15 minute video presentation with slides
    * Make sure you highlight the big picture first
3. Explain why your solution achieves the business objective
4. Don't forget to present interesting points you noticed along the way:
    * Describe what worked and what did not
    * List your assumptions and you model's limitations
5. Ensure your key findings are communicated through nice visualisations or easy-to-remember statements (e.g. "the median income is the number-one predictor of housing prices")
6. Upload the presentation to some online platform, e.g. YouTube or Vimeo, and supply a link to the video in the notebook.

Géron, A. 2017, *Hands-On Machine Learning with Scikit-Learn and Tensorflow*, Appendix B, O'Reilly Media, Inc., Sebastopol.