### 6. Text preprocessing

#### 6.1. Text preprocessing with nltk:


- Lowercasing: Convert all text to lowercase to maintain consistency.
- Tokenization: Split the text into individual words (tokens).
- Removing stop words.
- Lemmatization: reduce words to their base or root form to normalize variations.
- Removing special characters and numbers.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
%matplotlib inline
%config Inlinebackend.figure_format = 'retina'

import seaborn as sns
sns.set_context('poster')
sns.set(rc={'figure.figsize': (16., 9.)})
sns.set_style('whitegrid')
import seaborn as sns
import numpy as np
import re
import os

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [3]:
five = pd.read_csv("/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data/five.csv")
five.head(3)

Unnamed: 0,title,summary,genre
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction
1,The Plague,The text of The Plague is divided into five p...,literary fiction
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction


In [4]:
five.shape

(11013, 3)

In [5]:
leme = five.copy()
leme.head(2)

Unnamed: 0,title,summary,genre
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction
1,The Plague,The text of The Plague is divided into five p...,literary fiction


In [6]:
# I create a function that returns me a new column with the preprocessed text. This is the step previous to train the model.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Specify NLTK data directory
nltk_data_dir = "/path/to/your/nltk_data"  # Change this to the path where you want to store NLTK data

# Initialize NLTK resources outside the function
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocessing_8(leme):
    # Handle missing values
    leme['cleaned_summary'] = ''

    # Process each row in the DataFrame
    cleaned_summaries = []

    for index, row in leme.iterrows():
        # Preprocess the text
        text = row['summary'].lower()
        tokens = word_tokenize(text)

        # Modify the condition to handle hyphenated words
        filtered_tokens = [word for word in tokens if word.lower() not in stop_words and ('-' not in word or word.replace('-', '').isalpha())]

        lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
        # Allow hyphens in cleaned tokens and split hyphenated words
        clean_tokens = [re.sub(r'[^a-zA-Z-]', ' ', word).split() for word in lemmatized_tokens]

        cleaned_summary = ' '.join(' '.join(words) for words in clean_tokens)
        cleaned_summaries.append(cleaned_summary)

    # Update the DataFrame column outside the loop
    leme['cleaned_summary'] = cleaned_summaries
    leme['cleaned_summary'] = leme['cleaned_summary'].astype(str)

    return leme


[nltk_data] Downloading package punkt to /Users/usuari/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/usuari/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/usuari/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
# I execute the function

preprocessing_8(leme)

Unnamed: 0,title,summary,genre,cleaned_summary
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near-future england lea...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thou...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul b umer german soldier wh...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north ar...
4,Blade Runner 3: Replicant Night,"Living on Mars, Deckard is acting as a consul...",science fiction,living mar deckard acting consultant movie cr...
...,...,...,...,...
11008,Hounded,"Atticus O’Sullivan, last of the Druids, lives ...",fantasy,atticus sullivan last druid life peacefully...
11009,Charlie and the Chocolate Factory,Charlie Bucket's wonderful adventure begins wh...,fantasy,charlie bucket s wonderful adventure begin fin...
11010,Red Rising,"""I live for the dream that my children will be...",fantasy,live dream child born free say like land...
11011,Frostbite,"Rose loves Dimitri, Dimitri might love Tasha, ...",fantasy,rose love dimitri dimitri might love tasha m...


In [8]:
# I replace the "-" with a space to separate two tokens.
hyphen = leme.copy()
hyphen['cleaned_summary'] = hyphen['cleaned_summary'].str.replace('-', ' ')
hyphen.head(4)

Unnamed: 0,title,summary,genre,cleaned_summary
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lea...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thou...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul b umer german soldier wh...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north ar...


In [9]:
# I drop the tokens that only have one letter. 

from nltk.tokenize import word_tokenize
nltk.download('punkt')

hyphen['cleaned_summary'] = hyphen['cleaned_summary'].apply(lambda text: ' '.join(word for word in word_tokenize(text.lower()) if len(word) > 1))
hyphen.head(4)

[nltk_data] Downloading package punkt to /Users/usuari/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,title,summary,genre,cleaned_summary
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lead ...
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thousa...
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul umer german soldier who u...
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north arc...


In [10]:
# I save my dataframe with the changes saved:

hyphen.to_csv("hyphen.csv", index=False)

# Specify the folder path and filename for the CSV file
folder_path = "/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data"
file_name = "hyphen.csv"

# Combine the folder path and filename to create the full file path
full_file_path = f"{folder_path}/{file_name}"

# Export the DataFrame to the specified folder
hyphen.to_csv(full_file_path, index=False)

#### 6.2. Descriptive analysis of the dataset through NLP

1) NER: name entity recognition with a simple spacy model

2) NER: name entity recognition with a more complex spacy model

3) Add a column with the len of each summary, so that then I can do a groupby and plot a histogram for each genre. 

4) Add a column with the len of unique words of each summary, so that I can do a value_counts of which genre has more unique words 
(probably the fantasy genre). 

5) 5/10 top words of each genre (most representive words) + wordclouds



##### 6.2.1. NER with a simply Spacy model in English language (small)

In [12]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.7.0
    Uninstalling en-core-web-sm-3.7.0:
      Successfully uninstalled en-core-web-sm-3.7.0
Successfully installed en-core-web-sm-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [15]:
# I create a function that returns a new column with the name of entities for each summary. 

import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Extract entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    return entities

def extract_entities_from_column(hyphen, column_name='cleaned_summary'):
    
    # Apply the extract_entities function to each row in the specified column
    hyphen['entities'] = hyphen[column_name].apply(extract_entities)

    return hyphen

In [17]:
extract_entities_from_column(hyphen, column_name='cleaned_summary')

Unnamed: 0,title,summary,genre,cleaned_summary,entities
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lead ...,"[(alex teenager, PERSON), (future england, FAC..."
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thousa...,"[(five, CARDINAL), (three day, DATE), (quarant..."
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul umer german soldier who u...,"[(paul, PERSON), (german, NORP), (german, NORP..."
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north arc...,"[(one, CARDINAL), (one day, DATE), (apprentice..."
4,Blade Runner 3: Replicant Night,"Living on Mars, Deckard is acting as a consul...",science fiction,living mar deckard acting consultant movie cre...,[]
...,...,...,...,...,...
11008,Hounded,"Atticus O’Sullivan, last of the Druids, lives ...",fantasy,atticus sullivan last druid life peacefully ar...,"[(atticus sullivan last druid, PERSON), (arizo..."
11009,Charlie and the Chocolate Factory,Charlie Bucket's wonderful adventure begins wh...,fantasy,charlie bucket wonderful adventure begin find ...,"[(charlie bucket, PERSON)]"
11010,Red Rising,"""I live for the dream that my children will be...",fantasy,live dream child born free say like land fathe...,"[(one, CARDINAL), (darrow, DATE)]"
11011,Frostbite,"Rose loves Dimitri, Dimitri might love Tasha, ...",fantasy,rose love dimitri dimitri might love tasha mas...,"[(dimitri dimitri, PERSON), (tasha mason, ORG)..."


In [18]:
# I save my dataframe with the changes saved:

hyphen.to_csv("entities.csv", index=False)

# Specify the folder path and filename for the CSV file
folder_path = "/Users/usuari/Desktop/Ironhack/BOOTCAMP/projects/final_project/data"
file_name = "entities.csv"

# Combine the folder path and filename to create the full file path
full_file_path = f"{folder_path}/{file_name}"

# Export the DataFrame to the specified folder
hyphen.to_csv(full_file_path, index=False)

In [23]:
list(hyphen.entities)[0]

[('alex teenager', 'PERSON'),
 ('future england', 'FAC'),
 ('russian', 'NORP'),
 ('dim slow', 'PERSON'),
 ('second', 'ORDINAL'),
 ('alex', 'PERSON'),
 ('ludwig van novel', 'PERSON'),
 ('korova', 'PERSON'),
 ('alex skip', 'PERSON'),
 ('next day', 'DATE'),
 ('alex meet', 'PERSON'),
 ('ten year old', 'DATE'),
 ('alex', 'PERSON'),
 ('alex leadership', 'PERSON'),
 ('alex quells', 'PERSON'),
 ('alex attack', 'PERSON'),
 ('alex', 'PERSON'),
 ('alex', 'PERSON'),
 ('alex fellow', 'PERSON'),
 ('one', 'CARDINAL'),
 ('fifth', 'ORDINAL'),
 ('alex collapse', 'PERSON'),
 ('alex free', 'PERSON'),
 ('alex released', 'PERSON'),
 ('alex wanders', 'PERSON'),
 ('alex help friend', 'PERSON'),
 ('alex', 'PERSON'),
 ('two', 'CARDINAL'),
 ('alex', 'PERSON'),
 ('alex collapse', 'PERSON'),
 ('first', 'ORDINAL'),
 ('half', 'CARDINAL'),
 ('alex writer', 'PERSON'),
 ('alex question', 'PERSON'),
 ('alex symbol state', 'PERSON'),
 ('alex role happening', 'PERSON'),
 ('two year ago', 'DATE'),
 ('one', 'CARDINAL'),
 ('

##### 6.2.2. NER with a more complex Spacy model for English language (medium)

In [24]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [25]:
# I create a function that returns a new column with the name of entities for each summary. 

# Load the spaCy English language model
nlp = spacy.load("en_core_web_md")

def extract_entities(text):
    # Process the text with spaCy
    doc = nlp(text)
    
    # Extract entities
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    return entities

def extract_entities_2(df, column_name='cleaned_summary'):
    # Apply the extract_entities function to each row in the specified column
    df['accurate_entities'] = df[column_name].apply(extract_entities)

    return df

In [26]:
extract_entities_2(hyphen)

Unnamed: 0,title,summary,genre,cleaned_summary,entities,accurate_entities
0,A Clockwork Orange,"Alex, a teenager living in near-future Englan...",science fiction,alex teenager living near future england lead ...,"[(alex teenager, PERSON), (future england, FAC...","[(alex teenager, PERSON), (future england lead..."
1,The Plague,The text of The Plague is divided into five p...,literary fiction,text plague divided five part town oran thousa...,"[(five, CARDINAL), (three day, DATE), (quarant...","[(five, CARDINAL), (dr bernard rieux life, PER..."
2,All Quiet on the Western Front,"The book tells the story of Paul Bäumer, a Ge...",literary fiction,book tell story paul umer german soldier who u...,"[(paul, PERSON), (german, NORP), (german, NORP...","[(paul umer, PERSON), (german, NORP), (german,..."
3,A Wizard of Earthsea,"Ged is a young boy on Gont, one of the larger...",fantasy,ged young boy gont one larger island north arc...,"[(one, CARDINAL), (one day, DATE), (apprentice...","[(one, CARDINAL), (one, CARDINAL), (one, CARDI..."
4,Blade Runner 3: Replicant Night,"Living on Mars, Deckard is acting as a consul...",science fiction,living mar deckard acting consultant movie cre...,[],"[(mar deckard, PERSON)]"
...,...,...,...,...,...,...
11008,Hounded,"Atticus O’Sullivan, last of the Druids, lives ...",fantasy,atticus sullivan last druid life peacefully ar...,"[(atticus sullivan last druid, PERSON), (arizo...","[(atticus sullivan, PERSON), (arizona, GPE), (..."
11009,Charlie and the Chocolate Factory,Charlie Bucket's wonderful adventure begins wh...,fantasy,charlie bucket wonderful adventure begin find ...,"[(charlie bucket, PERSON)]","[(charlie bucket, PERSON), (wonka, PERSON), (w..."
11010,Red Rising,"""I live for the dream that my children will be...",fantasy,live dream child born free say like land fathe...,"[(one, CARDINAL), (darrow, DATE)]","[(darrow, GPE), (one day, DATE), (darrow, PERS..."
11011,Frostbite,"Rose loves Dimitri, Dimitri might love Tasha, ...",fantasy,rose love dimitri dimitri might love tasha mas...,"[(dimitri dimitri, PERSON), (tasha mason, ORG)...","[(dimitri dimitri, PERSON), (tasha mason, PERS..."


##### 6.2.3. NER with an even more complex Spacy model for English language (large)

- The larger models tend to offer better accuracy but may require more computational resources.

In [28]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m615.1 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [29]:
# I create a function that returns a new column with the name of entities for each summary. 

# Load the spaCy English language model
nlp = spacy.load("en_core_web_lg")

def extract_entities_3(df, column_name='cleaned_summary'):
    # Apply the extract_entities function to each row in the specified column
    df['more_accurate_entities'] = df[column_name].apply(extract_entities)

    return df

In [30]:
extract_entities_3(hyphen)

KeyboardInterrupt: 

#### 6.3. MySQLWorkbench

- Upload the final dataset in MySQLWorkbench.
- Do some queries, like selecting to support the descriptive analysis of the dataset. 
    - the AVG tokens per genre
    - the AVG entities per genre
    - the total len words per genre or AVG
    - the total len of unique words per genre or AVG
    - the different type of categories and count per category for each genre.
    - count of each type of entity per genre (MySQLWorkbench)
    - count total entities per genre.