## Vanessa Williams
## Milestone 4

## To complete milestone 5, I needed to change milestone 4

In [None]:
import pandas as pd
import re
import nltk
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
from nltk.corpus import stopwords

# Load NLTK stopwords and specify custom path
nltk.data.path.append("/Users/vanessawilliams/nltk_data")
stop_words = set(stopwords.words('english'))

# Load the dataset
file_path = '/Users/vanessawilliams/Desktop/Vanessa_Williams/ner.csv'
data = pd.read_csv(file_path)

# Clean the text as done in Milestone 3
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

data['cleaned_text'] = data['Sentence'].apply(lambda x: clean_text(str(x)))

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['cleaned_text'])

# Convert TF-IDF matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())

# Step 1: Clustering with KMeans
kmeans = KMeans(n_clusters=5, random_state=0)
data['cluster'] = kmeans.fit_predict(tfidf_matrix)

# Display top terms per cluster
print("Top terms per cluster:")
for i in range(5):
    cluster_terms = tfidf_df[data['cluster'] == i].sum().sort_values(ascending=False).head(10)
    print(f"\nCluster {i} terms:\n{cluster_terms}")

# Step 2: Named Entity Recognition (NER) with spaCy
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

data['entities'] = data['Sentence'].apply(extract_entities)

# Display sample NER results
print("\nSample Named Entities extracted:")
print(data[['Sentence', 'entities']].head(10))

# Step 3: Text Similarity with Euclidean Distance
sample_distance = euclidean_distances(tfidf_matrix[0], tfidf_matrix[1])
print(f"\nEuclidean Distance between first two sentences: {sample_distance[0][0]}")

  super()._check_params_vs_input(X, default_n_init=10)


Top terms per cluster:

Cluster 0 terms:
says          866.627043
government    577.079505
mr            533.264276
new           448.407089
united        417.288235
year          386.344681
country       371.693201
states        346.038956
world         338.975700
minister      335.793206
dtype: float64

Cluster 1 terms:
people      608.489151
killed      598.802400
wounded     175.191744
bomb        115.347962
says         99.357957
police       92.466762
baghdad      92.232516
attack       78.390278
soldiers     72.636617
attacks      66.686702
dtype: float64

Cluster 2 terms:
said          1162.600846
mr             150.891640
officials      116.717164
spokesman       98.297384
government      80.903639
statement       80.293106
tuesday         77.193566
military        73.721417
friday          72.734315
thursday        71.138870
dtype: float64

Cluster 3 terms:
say            922.872401
officials      427.230847
police         168.580283
authorities    148.392320
killed          

## Milestone 4: Building and Exploring the Text Model

### Objective
In Milestone 4, the goal was to build an initial model to explore the structure and themes in our text data. This involved identifying clusters or themes within the data and performing Named Entity Recognition (NER) to extract key entities. The goal was to develop insights into the topics covered in the dataset and identify the primary entities mentioned.

### Methodology

1. **Data Preprocessing**:
   - We cleaned the text by removing non-alphabetic characters, converting text to lowercase, and removing stopwords using NLTK's stopword list.
   - After cleaning, we applied TF-IDF vectorization to transform the text into a matrix of features, retaining the top 1,000 terms for analysis.

2. **KMeans Clustering**:
   - We performed KMeans clustering on the TF-IDF matrix with 5 clusters to group similar sentences.
   - This clustering helped us identify common themes by examining the most significant terms within each cluster.

3. **Named Entity Recognition (NER)**:
   - Using the `en_core_web_sm` model from spaCy, we conducted Named Entity Recognition on each sentence.
   - This step allowed us to extract entities such as people, locations, and organizations, giving more insight into the main subjects discussed in the text.

### Results

#### Clustering Analysis
Each cluster represents a distinct topic or theme in the data. Below are the top terms for each cluster and the themes they suggest:

- **Cluster 0**: Focuses on government-related topics, with terms like "government," "minister," "united," and "country," indicating discussions about governance, political leaders, and international relations.
- **Cluster 1**: Relates to conflict or violence, with terms like "people," "killed," "bomb," and "police." This suggests the presence of reports on attacks or incidents involving casualties.
- **Cluster 2**: Highlights statements and official reports, with words like "said," "officials," "spokesman," and "statement," likely representing press releases or official announcements.
- **Cluster 3**: Focuses on law enforcement and security, with terms like "officials," "police," "authorities," and "security," pointing to news about public safety and authority responses.
- **Cluster 4**: Centers on leadership and political figures, with frequent mentions of "president," "bush," "chavez," and "minister." This cluster seems to involve discussions around political leaders and events related to them.

#### Named Entity Recognition (NER)
By performing NER, we extracted entities such as names of people, locations, and organizations. This provides additional context for each cluster. For example:
- **Cluster 0** often includes countries or government entities, while **Cluster 4** prominently features political figures like "president" and specific names.
- This information will help us analyze trends related to particular entities and topics, adding depth to our understanding of the dataset.

### Observations and Interpretation

- **Cluster Analysis**: The clusters align well with recognizable themes in the text. This indicates that the data contains distinguishable topics, allowing us to interpret the dataset based on these key themes.
- **Entity Extraction**: The entities extracted through NER offer insight into the specific people, places, and organizations that appear frequently. This allows us to understand which entities are central to each theme, enhancing the analysis of cluster content.
- **Text Structure**: The combination of clustering and NER demonstrates that the text data is structured around key topics with distinct themes and notable entities, setting a solid foundation for further analysis.

### Conclusion
In Milestone 4, we successfully created an initial model that clusters the text data into themes and performs Named Entity Recognition. These steps provide a comprehensive view of the text’s structure, key topics, and primary entities. This exploratory analysis serves as a foundation for building more complex models and performing detailed topic or entity-focused analysis in the future.