# Vanessa Williams
# Milestone 3

### Load the dataset, Import necessary libraries, and inspect the initial rows to understand the structure.

In [None]:
import re
import nltk
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
import ssl

# Bypass SSL verification to download stopwords
ssl._create_default_https_context = ssl._create_unverified_context

# Download NLTK stopwords
nltk.download('stopwords')

# Load your dataset
file_path = '/Users/vanessawilliams/Desktop/Vanessa_Williams/corpus.csv'
data = pd.read_csv(file_path)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/reneulloa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Preprocessing the text

In [None]:
# Sample a smaller portion of the dataset for testing (e.g., 500 rows)
sampled_data = data.sample(n=500, random_state=42)

# Define a function to clean the text
def clean_text(text):
    stop_words = set(stopwords.words('english'))
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [None]:
# Apply the cleaning function to the text column
sampled_data['cleaned_text'] = sampled_data['text'].apply(lambda x: clean_text(str(x)))

# Display the cleaned text
print(sampled_data[['text', 'cleaned_text']].head())

                                                     text   
147298  Le président du conseil italien Enrico Letta, ...  \
264296  Une nouvelle fois, la Cour des comptes, dans s...   
328459  « Les collectivités locales ont en effet calcu...   
13102   La nouvelle loi sur l'immigration dans l'Arizo...   
355422  Editorial du « Monde ». Olivier Dussopt va ent...   

                                             cleaned_text  
147298  le prsident du conseil italien enrico letta le...  
264296  une nouvelle fois la cour des comptes dans son...  
328459  les collectivits locales ont en effet calcul l...  
13102   la nouvelle loi sur limmigration dans larizona...  
355422  editorial du monde olivier dussopt va entrepre...  


### Feature Extraction using TF-IDF Vectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=500, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(sampled_data['cleaned_text'])

# Display the TF-IDF feature matrix
print("TF-IDF Feature Matrix shape:", tfidf_matrix.shape)

TF-IDF Feature Matrix shape: (500, 500)


### Latent Dirichlet Allocation (LDA) for Topic Modeling

In [None]:
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_topics = lda_model.fit_transform(tfidf_matrix)

# Display the LDA Topics
print("LDA Topics shape:", lda_topics.shape)

LDA Topics shape: (500, 5)


### Clustering using KMeans

In [None]:
kmeans_model = KMeans(n_clusters=5, random_state=42)
kmeans_clusters = kmeans_model.fit_predict(tfidf_matrix)

# Display the cluster assignments
sampled_data['cluster'] = kmeans_clusters
print("KMeans Cluster assignments:\n", sampled_data[['cleaned_text', 'cluster']].head())

  super()._check_params_vs_input(X, default_n_init=10)


KMeans Cluster assignments:
                                              cleaned_text  cluster
147298  le prsident du conseil italien enrico letta le...        3
264296  une nouvelle fois la cour des comptes dans son...        3
328459  les collectivits locales ont en effet calcul l...        3
13102   la nouvelle loi sur limmigration dans larizona...        3
355422  editorial du monde olivier dussopt va entrepre...        3


### Analyzing each cluster's common terms

In [None]:
for i in range(5):
    print(f"\nTop terms in Cluster {i}:")
    top_terms_idx = kmeans_model.cluster_centers_.argsort()[:, -10:]
    for idx in top_terms_idx[i]:
        print(tfidf_vectorizer.get_feature_names_out()[idx])

# Save the result for later analysis
sampled_data.to_csv('/Users/vanessawilliams/Desktop/Vanessa_Williams/sample_processed.csv', index=False)


Top terms in Cluster 0:
des
comment
scne
femmes
lhistoire
conomique
mondiale
la
guerre
pourquoi

Top terms in Cluster 1:
que
pas
du
je
en
des
les
et
la
le

Top terms in Cluster 2:
trois
police
dfense
femmes
cinma
la
nos
abonns
article
rserv

Top terms in Cluster 3:
une
pour
dans
du
en
et
des
le
les
la

Top terms in Cluster 4:
dans
une
qui
des
du
les
en
et
le
la


### Text Data Processing and Feature Engineering Summary

#### 1. **Data Sampling and Preprocessing**:
   - We started by loading a large dataset containing text. Due to the dataset's size, we sampled 500 rows to ensure that processing would be more efficient.
   - We applied text cleaning steps where we:
     - Removed non-alphabetic characters.
     - Converted all text to lowercase.
     - Removed common stopwords using NLTK's stopword list.

#### 2. **TF-IDF (Term Frequency-Inverse Document Frequency) Feature Extraction**:
   - We transformed the cleaned text using the TF-IDF technique, which assigns a weight to each word based on how frequently it appears in individual documents compared to the entire corpus.
   - The resulting matrix had each document represented by a feature vector of word frequencies, capturing the most important words for each document.

#### 3. **Topic Modeling using LDA (Latent Dirichlet Allocation)**:
   - We applied LDA, a topic modeling algorithm, to group the text data into 5 topics. Each document was assigned to one of these topics, based on the dominant patterns of words it contained.

#### 4. **KMeans Clustering**:
   - We clustered the documents into 5 groups using the KMeans clustering algorithm, where each cluster contains documents that are similar to each other in terms of their word features.
   - For each cluster, we displayed the top 10 most common words, which provide insights into the main themes of each cluster. For example, Cluster 0 has common terms related to history, while Cluster 1 contains more conversational terms like "je" and "que".

#### 5. **Results and Interpretation**:
   - The top terms for each cluster give an indication of the main topics and themes present in the text data.
   - These clusters can help in better understanding the underlying structure of the text, identifying common themes, and grouping similar documents together.

This step-by-step approach allowed us to clean, process, and extract valuable insights from the text data.

Link to data: https://www.kaggle.com/datasets/manueldesiretaira/dataset-for-text-summarization/discussion?sort=hotness