# 20Newsgroups Pre-processing and Vectorization
This notebook generates some useful assets that can be used by other examples and notebooks in Fiddler for demonstration and debugging of NLP use cases. In particular, we use the public 20Newsgroups dataset and group the original targets into more general news categories. We combine the raw text data and the original and the new targets in a pandas DataFrame and store it as a CSV file. Furthremore, we vectorize this dataset using two text embedding methods (TF-IDF and OpenAI embedding) and store the resulting embeddings vectors. 

## Fetch the 20 Newsgroup Dataset and Group the Labels

First, we retrieve the 20Newsgroups dataset, which is available as part of the scikit-learn real-world dataset. This dataset contains around 18,000 newsgroup posts on 20 topics. The original dataset is available [here](http://qwone.com/~jason/20Newsgroups/).

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import os

In [None]:
data_bunch = fetch_20newsgroups(
    subset = 'train',
    shuffle=True,
    random_state=1,
    remove=('headers','footers','quotes')
)

A target name from 20 topics is assigned to each data sample in the above dataset, and you can access all the target names by running the: 
```
data_bunch.target_names
```
However, to make this example notebook simpler, we group similar topics and define more general targets as the following:



In [None]:
subcategories = {
    
    'computer': ['comp.graphics',
                 'comp.os.ms-windows.misc',
                 'comp.sys.ibm.pc.hardware',
                 'comp.sys.mac.hardware',
                 'comp.windows.x'],
    
    'politics': ['talk.politics.guns',
                 'talk.politics.mideast',
                 'talk.politics.misc'],
    
    'recreation':['rec.autos',
                  'rec.motorcycles',
                  'rec.sport.baseball',
                  'rec.sport.hockey'],
    
    'science': ['sci.crypt',
                'sci.electronics',
                'sci.med',
                'sci.space',],
    
    'religion': ['soc.religion.christian',
                 'talk.religion.misc',
                 'alt.atheism'],
    
    'forsale':['misc.forsale']
}

main_category = {}
for key,l in subcategories.items():
    for item in l:
        main_category[item] = key

Next we build a DataFrame in which both the original and the more general targets are stored toghether with the text docmunts and apply a few filters on the rows to make this dataset more usable.

In [None]:
MAX_TOKEN=4000
MAX_LENGTH=8000

In [None]:
data_prep = [s.replace('\n',' ').strip('\n,=,|,-, ,\,^') for s in data_bunch.data]
data_series = pd.Series(data_prep)
df = pd.DataFrame()
df['original_text'] = data_series
df['original_target'] = [data_bunch.target_names[t] for t in data_bunch.target]
df['target'] = [main_category[data_bunch.target_names[t]] for t in data_bunch.target]
df['original_text'].replace('', np.nan, inplace=True)
df.dropna(axis=0, subset=['original_text'], inplace=True)
df = df[df.target!='politics'] #delete political posts 

#more filters to pass OpenAI tokens limitation 
df['n_tokens'] = df['original_text'].apply(lambda s: len(s.split(' ')))
df = df[df['n_tokens'] < MAX_TOKEN]
df['string_size'] = df['original_text'].apply(lambda s: len(s))
df = df[df['string_size'] < MAX_LENGTH]

df.reset_index(drop=True, inplace=True)

## OpenAI Embeddings

In [None]:
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
MODEL = "text-embedding-ada-002"

### Batch query

In [None]:
def get_openai_embedding_batch(df, text_col_name, batch_size, model=MODEL):
    if batch_size>2000:
        raise ValueError('openai currently does not support chunks larger than 2000')
    embeddings = []
    for i in range(0, df.shape[0], batch_size):
        batch_df = df.iloc[i:i+batch_size] if i+batch_size<df.shape[0] else df.iloc[i:]
        response = openai.Embedding.create(
            input=batch_df[text_col_name].tolist(),
            model=model
        )
        response_embedding_list = [res['embedding'] for res in response['data']]
        embeddings += response_embedding_list
    embedding_col_names = ['openai_dim{}'.format(i+1) for i in range(len(embeddings[0]))]
    return pd.DataFrame(embeddings, columns=embedding_col_names)

In [None]:
batch_size = 2000
text_col_name = 'original_text'
openai_df = get_openai_embedding_batch(df, text_col_name, batch_size, model="text-embedding-ada-002")

# TF-IDF Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
embedding_dimension = 300
vectorizer = TfidfVectorizer(sublinear_tf=True,
                             max_features=embedding_dimension,
                             min_df=0.01,
                             max_df=0.9,
                             stop_words='english',
                             token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b')

tfidf_sparse = vectorizer.fit_transform(df['original_text'])
embedding_cols = vectorizer.get_feature_names_out()
embedding_col_names = ['tfidf_token_{}'.format(t) for t in embedding_cols]
tfidf_df = pd.DataFrame.sparse.from_spmatrix(tfidf_sparse, columns=embedding_col_names)

# Store Data

In [None]:
df.to_csv('20newsgroups_preprocessed.csv',index=False)
openai_df.to_csv('20newsgroups_openai_embeddings.csv',index=False)
tfidf_df.to_csv('20newsgroups_tfidf_embeddings.csv',index=False)