## NLP Topic Modeling Exercise

In [1]:
# import TfidfVectorizer and CountVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# import fetch_20newsgroups from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# import NMF and LatentDirichletAllocation from sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

* create a variable called `'no_features'` and set its value to 100.

In [4]:
no_features = 100

* create a variable `'no_topics'` and set its value to 100

In [5]:
no_topics = 100

## NMF

* instantiate a TfidfVectorizer with the following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [6]:
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of TfidfVectorizer to transform the documents

In [7]:
X = vectorizer.fit_transform(documents)

* get the features names from TfidfVectorizer

In [8]:
vectorizer.get_feature_names_out()

array(['00', '10', '12', '14', '15', '16', '20', '25', 'a86', 'available',
       'ax', 'b8f', 'believe', 'best', 'better', 'bit', 'case', 'com',
       'come', 'course', 'data', 'day', 'did', 'didn', 'different',
       'does', 'doesn', 'don', 'drive', 'edu', 'fact', 'far', 'file',
       'g9v', 'god', 'going', 'good', 'got', 'government', 'help',
       'information', 'jesus', 'just', 'key', 'know', 'law', 'let',
       'like', 'line', 'list', 'little', 'll', 'long', 'look', 'lot',
       'mail', 'make', 'max', 'mr', 'need', 'new', 'number', 'people',
       'point', 'power', 'probably', 'problem', 'program', 'question',
       'read', 'really', 'right', 'run', 'said', 'say', 'second', 'set',
       'software', 'space', 'state', 'sure', 'tell', 'thanks', 'thing',
       'things', 'think', 'time', 'true', 'try', 'use', 'used', 'using',
       've', 'want', 'way', 'windows', 'work', 'world', 'year', 'years'],
      dtype=object)

* instantiate NMF and fit transformed data

In [10]:
nmf = NMF()

In [12]:
nmf_X = nmf.fit_transform(X)



In [13]:
nmf_X

array([[0.0186895 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.04587933, 0.        ,
        0.02482667],
       [0.        , 0.        , 0.        , ..., 0.        , 0.07081967,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01037582, 0.        , 0.        , ..., 0.        , 0.03395286,
        0.        ]])

## LDA w/ Sklearn

* instantiate a CountVectorizer with following parameters:


    * max_df = 0.95
    * min_df = 2
    * max_features = no_features
    * stop_words = 'english'

In [14]:
cvec = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')

* use fit_transform method of CountVectorizer to transform documents

In [16]:
X_cvec = cvec.fit_transform(documents)

* get the features names from TfidfVectorizer

In [18]:
cvec.get_feature_names_out()

array(['00', '10', '12', '14', '15', '16', '20', '25', 'a86', 'available',
       'ax', 'b8f', 'believe', 'best', 'better', 'bit', 'case', 'com',
       'come', 'course', 'data', 'day', 'did', 'didn', 'different',
       'does', 'doesn', 'don', 'drive', 'edu', 'fact', 'far', 'file',
       'g9v', 'god', 'going', 'good', 'got', 'government', 'help',
       'information', 'jesus', 'just', 'key', 'know', 'law', 'let',
       'like', 'line', 'list', 'little', 'll', 'long', 'look', 'lot',
       'mail', 'make', 'max', 'mr', 'need', 'new', 'number', 'people',
       'point', 'power', 'probably', 'problem', 'program', 'question',
       'read', 'really', 'right', 'run', 'said', 'say', 'second', 'set',
       'software', 'space', 'state', 'sure', 'tell', 'thanks', 'thing',
       'things', 'think', 'time', 'true', 'try', 'use', 'used', 'using',
       've', 'want', 'way', 'windows', 'work', 'world', 'year', 'years'],
      dtype=object)

* instantiate LatentDirichletAllocation and fit transformed data 

* create a function `display_topics` that is able to display the top words in a topic for different models

* display top 10 words from each topic from NMF model

* display top 10 words from each topic from LDA model

### Stretch: Use LDA w/ Gensim to do the same thing.