# Stage 1: Enhanced Data Cleaning, Preprocessing, and Exploratory Analysis 
In this notebook, we perform **data cleaning, preprocessing, and exploratory analysis (EDA)** on the Cleantech Media and Google Patent datasets. The goal is to identify **trends, key technologies, and innovation gaps** by analyzing media publications and patents.

In [1]:
import pandas as pd

## Data Collection and Cleaning
Before analyzing the data, we first **load, inspect, and clean** the datasets:  

- **Load datasets**: We import the **Cleantech Media Dataset** and the **Cleantech Google Patent Dataset** into Pandas DataFrames.  
- **Remove duplicates**: Identical or near-identical entries are removed to prevent data bias.  
- **Handle missing values**: We check for null or incomplete entries and decide whether to impute, replace, or remove them.  
- **Filter relevant information**: Non-informative texts (e.g., generic statements) are removed to ensure high-quality analysis.  

In [18]:
media_dataset_path = "../data/cleantech_media_dataset_v3_2024-10-28.csv"
google_patent_dataset_path = "../data/cleantech_rag_evaluation_data_2024-09-20.csv"

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 20)

# Load CSV files
df_media = pd.read_csv(media_dataset_path, header = 0)
df_google_patents = pd.read_csv(google_patent_dataset_path, sep = ";", header = 0)

print("Raw cleantech_media_dataset_v3_2024-10-28.csv:")
print(df_media.info())
df_media

print("Raw cleantech_rag_evaluation_data_2024-09-20.csv:")
print(df_google_patents.info())
print(df_google_patents.head())

Raw cleantech_media_dataset_v3_2024-10-28.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20111 entries, 0 to 20110
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  20111 non-null  int64  
 1   title       20111 non-null  object 
 2   date        20111 non-null  object 
 3   author      0 non-null      float64
 4   content     20111 non-null  object 
 5   domain      20111 non-null  object 
 6   url         20111 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 1.1+ MB
None
Raw cleantech_rag_evaluation_data_2024-09-20.csv:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 6 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   example_id                  23 non-null     object
 1   question_id                 23 non-null     object
 2   question                    

In [16]:
# Create a new dataframe for the processed data
df_media_processed = df_media.rename(columns={df_media.columns[0]: 'id'})
df_media_processed.drop(columns=['author'], inplace=True)
df_media_processed

Unnamed: 0,id,title,date,content,domain,url
0,93320,XPeng Delivered ...,2022-01-02,['Chinese automo...,cleantechnica,https://cleantec...
1,93321,Green Hydrogen: ...,2022-01-02,['Sinopec has la...,cleantechnica,https://cleantec...
2,98159,World’ s largest...,2022-01-03,['Huaneng Power ...,pv-magazine,https://www.pv-m...
3,98158,Iran wants to de...,2022-01-03,['According to t...,pv-magazine,https://www.pv-m...
4,31128,Eastern Intercon...,2022-01-03,['Sign in to get...,naturalgasintel,https://www.natu...
...,...,...,...,...,...,...
20106,104263,US Treasury fina...,2024-10-24,['The US Departm...,pv-tech,https://www.pv-t...
20107,104264,EDP trials robot...,2024-10-24,['Developer EDP ...,pv-tech,https://www.pv-t...
20108,101434,Australia has 7....,2024-10-24,['The volume of ...,pv-magazine,https://www.pv-m...
20109,101428,Residential PV p...,2024-10-24,['The comparison...,pv-magazine,https://www.pv-m...


## Text Preprocessing
To ensure that the text data is **ready for NLP tasks**, we preprocess it using common natural language processing (NLP) techniques:  

- **Tokenization**: Split text into individual words or subwords for better analysis.  
- **Stopword Removal**: Common but uninformative words (e.g., "the", "is", "and") are removed.  
- **Stemming & Lemmatization**: Words are reduced to their root form (e.g., "developing" → "develop").  
- **Lowercasing**: Standardize all text to lowercase to avoid duplicate entries.  

These steps improve the quality of text-based analysis and ensure consistency across datasets.

## Exploratory Data Analysis
EDA helps us **understand data patterns and distributions** before applying complex NLP models. We perform:  

- **Temporal Analysis**: We examine **publication trends** over time to detect emerging Cleantech topics.  
- **Named Entity Recognition (NER)**: Identify key **companies, organizations, and technologies** frequently mentioned in the datasets.  
- **Word Frequency Analysis**: Find the most common words and phrases across media and patents.  
- **Visualization**:  
  - **Word Clouds** to showcase frequently occurring terms  
  - **Bar Charts** to compare key industry players and technology mentions  
  - **Network Graphs** to analyze relationships between companies and technologies  

## Topic Modeling
To **identify hidden themes and emerging trends**, we apply topic modeling techniques on both datasets:  

- **Latent Dirichlet Allocation (LDA)** and **Non-Negative Matrix Factorization (NMF)** to uncover broad thematic structures.  
- **Top2Vec** and **BERTopic** for **more dynamic and context-aware topic modeling**.  
- **Comparing Media vs. Patents**:  
  - Which Cleantech topics are **gaining media attention** but **not patented** yet?  
  - Are **patents aligned with market trends**, or do they focus on different areas?  
  - **What are the innovation gaps** between research and real-world applications?  

By the end of this step, we will have a **structured view of the Cleantech landscape**, highlighting **key trends, players, and technological opportunities**.
