### USECASE_1 : sentiment analysis of comments


- **Source:** `sentiment_analysis_reviews_0.csv`, sample comments (568454).
- **Objective:** to provide a sentiment analysis on all of these comments. Negative/positive, prediction via ML on the nature of the comment


- **Note:** For educational purpose and du to limitation on GitHub, I have reduced the dataset to 1000 rows. The dataset came from this file at https://www.kaggle.com/code/robikscube/sentiment-analysis-python-youtube-tutorial/input

```bash
# the complete dataset
[568454 rows x 10 columns]

# columns
'Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'
```




### 1. Data Collection

In [1]:
# DATA
import numpy as np
import pandas as pd

In [3]:
CSV_SOURCE="data_source/sentiment_analysis_reviews_0.csv"
# Load the dataframe from CSV
df = pd.read_csv(CSV_SOURCE)

# show content
print(df)

# show columns
print(df.columns)

       Id   ProductId          UserId                      ProfileName  \
0       1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1       2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2       3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3       4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4       5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
..    ...         ...             ...                              ...   
995   996  B006F2NYI2  A1D3F6UI1RTXO0                           Swopes   
996   997  B006F2NYI2   AF50D40Y85TV3                          Mike A.   
997   998  B006F2NYI2  A3G313KLWDG3PW                          kefka82   
998   999  B006F2NYI2  A3NIDDT7E7JIFW                  V. B. Brookshaw   
999  1000  B006F2NYI2  A132DJVI37RB4X                        Scottdrum   

     HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                       1              

### 2. Data Preparation

**In the preparation phase, it is often necessary to reduce the size of the files, everything is explained in the file `001_split_files.py`**

In [4]:
#!/usr/bin/python
# -*- coding: utf-8 -*-

"""
[env]
# Conda Environment
conda create --name sentiment_analysis python=3.9.13
conda info --envs
source activate sentiment_analysis
conda deactivate

# if needed to remove
conda env remove -n [NAME_OF_THE_CONDA_ENVIRONMENT]

# update conda 
conda update -n base -c defaults conda

# to export requirements
pip freeze > requirements.txt

# to install
pip install -r requirements.txt


# [path]
cd /Users/brunoflaven/Documents/01_work/blog_articles/ia_llms_usecases/usecase_1_sentiment_analysis/

# LAUNCH the file
python 001_split_files.py


[install]
python -m pip install transformers
python -m pip install pyarrow
python -m pip install pandas
python -m pip install numpy
python -m pip install tensorflow
python -m pip install sentencepiece

[source]
# multilingual
https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student

# french
https://huggingface.co/cmarkea/distilcamembert-base-sentiment

The dataset comprises 204,993 reviews for training and 4,999 reviews for the test from Amazon, and 235,516 and 4,729 critics from Allocine website. The dataset is labeled into five categories:


1 étoile : représente une appréciation terrible,
2 étoiles : mauvaise appréciation,
3 étoiles : appréciation neutre,
4 étoiles : bonne appréciation,
5 étoiles : excellente appréciation.

1 star: represents a terrible appreciation,
2 stars: bad appreciation,
3 stars: neutral appreciation,
4 stars: good appreciation,
5 stars: excellent appreciation.

"""

# DATA
import numpy as np  # Importing numpy library and aliasing it as np
import pandas as pd  # Importing pandas library and aliasing it as pd

##### VALUES
CSV_SOURCE="data_source/sentiment_analysis_reviews_0.csv"  # Assigning a file path to CSV_SOURCE variable

# Reading the data from CSV_SOURCE file into a pandas DataFrame
data = pd.read_csv(CSV_SOURCE)
print(data)  # Printing the DataFrame to the console

# Define the number of CSV files to split the data into
k = 50
# Define the size of each split
size = 20

# Loop to split the data into k files
for i in range(k):
    # Slicing the DataFrame to select rows for the current split
    df = data[size*i:size*(i+1)]
    # Writing the selected rows to a new CSV file with a unique name
    df.to_csv(f'data_split/sentiment_analysis_reviews_sample_{i+1}.csv', index=False)
    # Printing a message indicating that the file has been created
    print (f'the file data_split/sentiment_analysis_reviews_sample_{i+1}.csv has been created')

print('\n--- DONE')  # Printing a message indicating that the splitting process is done





       Id   ProductId          UserId                      ProfileName  \
0       1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1       2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2       3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3       4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4       5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
..    ...         ...             ...                              ...   
995   996  B006F2NYI2  A1D3F6UI1RTXO0                           Swopes   
996   997  B006F2NYI2   AF50D40Y85TV3                          Mike A.   
997   998  B006F2NYI2  A3G313KLWDG3PW                          kefka82   
998   999  B006F2NYI2  A3NIDDT7E7JIFW                  V. B. Brookshaw   
999  1000  B006F2NYI2  A132DJVI37RB4X                        Scottdrum   

     HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                       1              

### 3. Feature Engineering and Modelling

**In the modeling phase, you often have to select the best model and iterate to see if the prediction is good. In the case of sentiment analysis, this is a known problem, the only difficulty is finding a model that has been trained on the language, in this case French. Everything is explained in the file `002_sentiment_analysis.py`**

### CAUTION you must ensure that the source .csv files exist in the correct directory and that the destination directory exists, that is to say that the architecture of the project and the same