#  Task 2: Extract prevalent topics from Twitter messages

## 🎯 Objective

- The goal of this project is to **extract the most frequent topics** discussed in tweets related to a specific city or region using **Natural Language Processing (NLP)** techniques.
- This analysis helps identify **trends, public concerns, and popular discussions** by processing tweet content.
- The insights can support **local researchers, journalists, or policymakers** by showing what people are actively talking about in a particular location.

---

## 🧭 Workflow Overview

- 🔐 **Set up API access**: I created a Twitter Developer account and connected to the **Twitter API v2** using the `tweepy` library.

- 📥 **Collect tweets**: Tweets were collected using **location-related keywords** such as city names and hashtags (e.g., *Tashkent*, *Uzbekistan*, `#Tashkent`).



- 🧼 **Clean and preprocess data**  
  - Remove URLs, mentions, emojis, and duplicates  
  - Apply NLP preprocessing: **tokenization**, **stopword removal**, and **lemmatization**

- 🔍 **Entity analysis**  
  - Identify the most frequently used **hashtags**  
  - Highlight **top contributors** (while anonymizing personal data)

- 🧠 **Topic modeling techniques**  
  I experimented with the following methods to extract key topics:
  - **LDA (Latent Dirichlet Allocation)**
  - **NMF (Non-negative Matrix Factorization)**
  - **BERTopic** (for context-rich topic extraction using transformers)

- 📊 **Visualize & interpret results**  
  - Generate charts and visual summaries of the **top 5 most prevalent topics**
  - Analyze patterns and draw meaningful conclusions
---

## Diagram for ScholarAI

![ScholarAI Diagram](https://i.postimg.cc/jj4GrYzg/Diagram-1.png)

# Notebook Installations

In [1]:
!pip install -qU tweepy

In [2]:
import tweepy
import time
import pandas as pd
from kaggle_secrets import UserSecretsClient

# NLTK
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer

from IPython.display import HTML
import pandas as pd
from tweepy.errors import TooManyRequests
import pandas as pd
from IPython.display import HTML

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

! unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Archive:  /usr/share/nltk_data/corpora/wordnet.zip
   creating: /usr/share/nltk_data/corpora/wordnet/
  inflating: /usr/share/nltk_data/corpora/wordnet/lexnames  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adv  
  inflating: /usr/share/nltk_data/corpora/wordnet/adv.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.verb  
  inflating: /usr/share/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /usr/share/nltk_data/corpora/wordnet/data.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.adj  
  inflating: /usr/share/nltk_data/corpora/wordnet/LICENSE  
  inflating: /usr/share/nltk_data/corpora/wordnet/citation.bib  
  inflating: /usr/share/nltk_data/corpora/wordnet/noun.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/verb.exc  
  inflating: /usr/share/nltk_data/corpora/wordnet/README  
  inflating: /usr/share/nltk_data/corpora/wordnet/index.sense  
  inflating: /usr

In [4]:
user_secrets = UserSecretsClient()
# access_token = user_secrets.get_secret("access_token")
# access_token_secret = user_secrets.get_secret("access_token_secret")
bearer_token = user_secrets.get_secret("BEARER_TOKEN")
# consumer_key = user_secrets.get_secret("consumer_key")
# consumer_secret = user_secrets.get_secret("consumer_secret")

## Step 1: Data Collection

I used the **Twitter API v2** to collect recent tweets related to Uzbekistan by searching for keywords like **"Tashkent"**, **"Uzbekistan"**, and **#Uzbekistan.** Since I don't have access to location-based filters, I relied on these keywords along with the language filter **lang:en** to get relevant tweets.

I fetched tweets in multiple batches using **Tweepy**, with a delay between requests to respect the rate limit. Each tweet includes the text, author ID, language, creation time, and context annotations (if available).

In [5]:
client = tweepy.Client(bearer_token=bearer_token)

In [6]:
# Pre-define df in case all branches fail
df = None

query = (
    "(Tashkent OR Uzbekistan OR #Uzbekistan OR #Tashkent OR Samarqand OR Samarkand) "
    "-is:retweet lang:en"
)

try: 
    tweet = client.search_recent_tweets(
        query=query,
        tweet_fields=['context_annotations', 'created_at', 'author_id', 'lang'],
        max_results=100
    )

    if tweet.data is not None:
        tweet_list = []
        for t in tweet.data:
            tweet_list.append({
                "id": t.id,
                "text": t.text,
                "created_at": t.created_at,
                "author_id": t.author_id,
                "lang": t.lang,
                "context_annotations": t.context_annotations if hasattr(t, "context_annotations") else None
            })

        df = pd.DataFrame(tweet_list)
        print(df.head())
    else:
        print("No tweets returned.")

except TooManyRequests:
    print("Twitter API limit reached. Loading fallback CSV...")
    df = pd.read_csv("/kaggle/input/tweets-from-uzbekistan/submission (1).csv")
    if 'Unnamed: 0' in df.columns:
        df.drop('Unnamed: 0', axis=1, inplace=True)
    print(df.head())

finally:
    if df is not None:
        df.to_csv('submission.csv', index=False)

        def create_download_link(title="Download CSV file", filename="submission.csv"):  
            html = f'<a href="{filename}">{title}</a>'
            return HTML(html)

        create_download_link(filename='submission.csv')
    else:
        print("No DataFrame available to save.")

Twitter API limit reached. Loading fallback CSV...
                    id                                               text  \
0  1913128503744373141  @mlecchabaap11 @MR_Blue_EyeOF @goodbroto @Shit...   
1  1913125520717345124                               @lukebelmar Tashkent   
2  1913125509816348753  @brfootball @aguneribe is this also the most s...   
3  1913124753470263713  @food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! BAZAAR! ...   
4  1913124751788458286  @food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! Unusual ...   

                  created_at            author_id lang  \
0  2025-04-18 07:12:34+00:00  1635318248433664001   en   
1  2025-04-18 07:00:42+00:00            214155473   en   
2  2025-04-18 07:00:40+00:00  1056126774768218114   en   
3  2025-04-18 06:57:40+00:00  1863847655589462016   en   
4  2025-04-18 06:57:39+00:00  1862298597116846082   en   

                                 context_annotations  
0                                                NaN  
1                          

In [7]:
df.shape

(100, 6)

In [8]:
df.head()

Unnamed: 0,id,text,created_at,author_id,lang,context_annotations
0,1913128503744373141,@mlecchabaap11 @MR_Blue_EyeOF @goodbroto @Shit...,2025-04-18 07:12:34+00:00,1635318248433664001,en,
1,1913125520717345124,@lukebelmar Tashkent,2025-04-18 07:00:42+00:00,214155473,en,
2,1913125509816348753,@brfootball @aguneribe is this also the most s...,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des..."
3,1913124753470263713,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! BAZAAR! ...,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."
4,1913124751788458286,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! Unusual ...,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."


In [9]:
print(f"The empty row in the data: \n{df.isnull().sum()}")

The empty row in the data: 
id                      0
text                    0
created_at              0
author_id               0
lang                    0
context_annotations    48
dtype: int64


In [10]:
print(f"The types of the data: \n{df.dtypes}\n")

The types of the data: 
id                      int64
text                   object
created_at             object
author_id               int64
lang                   object
context_annotations    object
dtype: object



## Step 2: Data Preprocessing


* **Text Cleaning:** Remove any HTML tags, special characters, numbers, and other non-alphabetic characters.
* **Tokenization:** Split the reviews into individual words (tokens).
* **Stop Words Removal:** Remove common words that do not contribute to the sentiment, such as 'and', 'the', 'is', etc.
* **Lemmatization:** Reduce words to their base or root form.


In [11]:
df.drop_duplicates('text', inplace=True)
print(df.shape)
df.head()

(100, 6)


Unnamed: 0,id,text,created_at,author_id,lang,context_annotations
0,1913128503744373141,@mlecchabaap11 @MR_Blue_EyeOF @goodbroto @Shit...,2025-04-18 07:12:34+00:00,1635318248433664001,en,
1,1913125520717345124,@lukebelmar Tashkent,2025-04-18 07:00:42+00:00,214155473,en,
2,1913125509816348753,@brfootball @aguneribe is this also the most s...,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des..."
3,1913124753470263713,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! BAZAAR! ...,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."
4,1913124751788458286,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! Unusual ...,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."


In [12]:
# remove html tags from text
def remove_html_tags(text):
    """
    Remove HTML tags from a given string.

    Parameters:
        text (str): The input string containing HTML tags.

    Returns:
        str: The cleaned string with HTML tags removed.
    """
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [13]:
def clean_review(text, wl=WordNetLemmatizer(), 
                 stop_words=set(stopwords.words('english'))):
    """
    Clean and preprocess a text string for NLP analysis.

    This function performs several preprocessing steps:
    - Removes HTML tags
    - Converts text to lowercase
    - Tokenizes the text
    - Removes stopwords and non-alphabetic tokens
    - Applies lemmatization

    Parameters:
        text (str): The input text string to be cleaned.
        wl (WordNetLemmatizer, optional): Lemmatizer instance. Defaults to WordNetLemmatizer().
        stop_words (set, optional): Set of stopwords to remove. Defaults to English stopwords from NLTK.

    Returns:
        str: The cleaned and lemmatized text string.
    """
    cleaned_text = remove_html_tags(text)
    words = word_tokenize(cleaned_text.lower())
    
    filtered_words = []
    for word in words:
        if word not in stop_words and word.isalpha():
            filtered_words.append(wl.lemmatize(word))
    
    filtered_words = ' '.join(filtered_words)    
    return filtered_words

In [14]:
print(df.columns)

Index(['id', 'text', 'created_at', 'author_id', 'lang', 'context_annotations'], dtype='object')


In [15]:
df.head()

Unnamed: 0,id,text,created_at,author_id,lang,context_annotations
0,1913128503744373141,@mlecchabaap11 @MR_Blue_EyeOF @goodbroto @Shit...,2025-04-18 07:12:34+00:00,1635318248433664001,en,
1,1913125520717345124,@lukebelmar Tashkent,2025-04-18 07:00:42+00:00,214155473,en,
2,1913125509816348753,@brfootball @aguneribe is this also the most s...,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des..."
3,1913124753470263713,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! BAZAAR! ...,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."
4,1913124751788458286,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! Unusual ...,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax..."


In [16]:
%%time
import re

# Apply the `clean_review` function to each tweet
df['clean text'] = df['text'].apply(clean_review)
df.head()

CPU times: user 3.22 s, sys: 352 ms, total: 3.57 s
Wall time: 3.58 s


Unnamed: 0,id,text,created_at,author_id,lang,context_annotations,clean text
0,1913128503744373141,@mlecchabaap11 @MR_Blue_EyeOF @goodbroto @Shit...,2025-04-18 07:12:34+00:00,1635318248433664001,en,,goodbroto shitpostgate central asia typically ...
1,1913125520717345124,@lukebelmar Tashkent,2025-04-18 07:00:42+00:00,214155473,en,,lukebelmar tashkent
2,1913125509816348753,@brfootball @aguneribe is this also the most s...,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des...",brfootball aguneribe also successful club uzbe...
3,1913124753470263713,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! BAZAAR! ...,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan bazaar shock recipe chicken br...
4,1913124751788458286,@food_porn17 #Food 🎉👨‍🍳🎖️ UZBEKISTAN! Unusual ...,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan unusual recipe best meat manga...


In [17]:
df.drop('text', axis=1, inplace=True)
df.head()

Unnamed: 0,id,created_at,author_id,lang,context_annotations,clean text
0,1913128503744373141,2025-04-18 07:12:34+00:00,1635318248433664001,en,,goodbroto shitpostgate central asia typically ...
1,1913125520717345124,2025-04-18 07:00:42+00:00,214155473,en,,lukebelmar tashkent
2,1913125509816348753,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des...",brfootball aguneribe also successful club uzbe...
3,1913124753470263713,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan bazaar shock recipe chicken br...
4,1913124751788458286,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan unusual recipe best meat manga...


In [18]:
df.drop_duplicates('clean text', inplace=True) 
df.reset_index(drop=True, inplace=True)

In [19]:
print(df.shape)
df.head()

(92, 6)


Unnamed: 0,id,created_at,author_id,lang,context_annotations,clean text
0,1913128503744373141,2025-04-18 07:12:34+00:00,1635318248433664001,en,,goodbroto shitpostgate central asia typically ...
1,1913125520717345124,2025-04-18 07:00:42+00:00,214155473,en,,lukebelmar tashkent
2,1913125509816348753,2025-04-18 07:00:40+00:00,1056126774768218114,en,"[{'domain': {'id': '11', 'name': 'Sport', 'des...",brfootball aguneribe also successful club uzbe...
3,1913124753470263713,2025-04-18 06:57:40+00:00,1863847655589462016,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan bazaar shock recipe chicken br...
4,1913124751788458286,2025-04-18 06:57:39+00:00,1862298597116846082,en,"[{'domain': {'id': '46', 'name': 'Business Tax...",food uzbekistan unusual recipe best meat manga...
