# NLP Final Project - Data Pre-processing & Topic Modeling

For this final project, there is a collection of ~200K news articles on our favorite topics, data science, machine learning, and artificial intelligence. Our task is to identify what industries and job lines are going to be most impacted by AI over the next several years, based on the information/insights you can extract from this text corpus.

Goal: provide actionable recommendations on what can be done with AI to automate the jobs, improve employee productivity, and generally make AI adoption successful. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.


## Importing Data

In [4]:
import pandas as pd

!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow
Successfully installed pyarrow-16.1.0


In [5]:
df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')
df_news_final_project.shape

(200141, 5)

In [6]:
df_news_final_project.head()

Unnamed: 0,url,date,language,title,text
0,http://auckland.scoop.co.nz/2020/01/aut-boosts...,2020-01-28,en,auckland.scoop.co.nz » AUT boosts AI expertise...,\n\nauckland.scoop.co.nz » AUT boosts AI exper...
1,http://spaceref.com/astronomy/observation-simu...,2021-07-05,en,"Observation, Simulation, And AI Join Forces To...","\n\nObservation, Simulation, And AI Join Force..."
2,http://www.mysmartrend.com/news-briefs/technic...,2020-04-17,en,Cr Bard Inc Has Returned 48.9% Since SmarTrend...,\n\nCr Bard Inc Has Returned 48.9% Since SmarT...
3,http://www.productivityapps.itbusinessnet.com/...,2020-06-23,en,Applitools Visual AI Reaches One Billion Image...,\n\nApplitools Visual AI Reaches One Billion I...
4,http://www.sbwire.com/press-releases/data-scie...,2020-12-24,en,Data Science and Machine-Learning Platforms Ma...,\n\nData Science and Machine-Learning Platform...


## Data Cleaning

In [7]:
text = df_news_final_project[['text']]

In [8]:
import re

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    return text

text['cleaned_text'] = text['text'].apply(clean_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['cleaned_text'] = text['text'].apply(clean_text)


In [9]:
!pip install nltk

## Text Cleaning for Topic Modeling

In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import string

nltk.download('punkt')
nltk.download('stopwords')

def preprocess_for_topic_modeling(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

text['topic_tokens'] = text['cleaned_text'].apply(preprocess_for_topic_modeling)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['topic_tokens'] = text['cleaned_text'].apply(preprocess_for_topic_modeling)


## Text Cleaning for Entity Recognition

In [11]:

def preprocess_for_entity_recognition(text):
utili    return tokens

text['entity_tokens'] = text['cleaned_text'].apply(preprocess_for_entity_recognition)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['entity_tokens'] = text['cleaned_text'].apply(preprocess_for_entity_recognition)


## Filter relevant articles

In [12]:
# Define keywords that might indicate relevance to AI, machine learning, etc.
keywords = ['artificial intelligence', 'AI', 'machine learning', 'ML', 'deep learning', 'neural network', 'data science']

# Filtering function to check for the presence of keywords
def filter_relevant_articles(text):
    return any(keyword in text.lower() for keyword in keywords)

# Apply the filter function
df = text[text['cleaned_text'].apply(filter_relevant_articles)]


In [13]:
print(df.shape)
df.head()

(145456, 4)


Unnamed: 0,text,cleaned_text,topic_tokens,entity_tokens
0,\n\nauckland.scoop.co.nz » AUT boosts AI exper...,aucklandscoopconz AUT boosts AI expertise wi...,"[aucklandscoopconz, aut, boost, ai, expertis, ...","[aucklandscoopconz, AUT, boosts, AI, expertise..."
1,"\n\nObservation, Simulation, And AI Join Force...",Observation Simulation And AI Join Forces To ...,"[observ, simul, ai, join, forc, reveal, clear,...","[Observation, Simulation, And, AI, Join, Force..."
3,\n\nApplitools Visual AI Reaches One Billion I...,Applitools Visual AI Reaches One Billion Imag...,"[applitool, visual, ai, reach, one, billion, i...","[Applitools, Visual, AI, Reaches, One, Billion..."
4,\n\nData Science and Machine-Learning Platform...,Data Science and MachineLearning Platforms Ma...,"[data, scienc, machinelearn, platform, market,...","[Data, Science, and, MachineLearning, Platform..."
5,\n\nHealthcare Artificial Intelligence Market ...,Healthcare Artificial Intelligence Market Ana...,"[healthcar, artifici, intellig, market, analys...","[Healthcare, Artificial, Intelligence, Market,..."


## Save tokenized text

In [15]:
df.to_json('tokenized_text.json', orient='records', lines=True)