# INTRODUCTION TO NLP

## SESSION 6:

## Vectorisation of tokens and similarity of documents

Vectorization of tokens refers to the process of converting text data into a numerical format that can be used as input for machine learning algorithms. In natural language processing, vectorization is commonly used to represent each word in a document as a vector of numbers, based on its frequency, context, or other linguistic properties.

 it is done for: Numerical representation, Dimensionality reduction, Document classification, Information retrieval

In [301]:
# !pip install gensim

import gensim
import spacy
nlp=spacy.load('en_core_web_sm')

## Create a list of texts

In [302]:
doc_1='''Chat GPT is a highly popular AI-based program that people use for generating dialogues. The chatbot has a language-based model that the developer fine-tunes for human interaction in a conversational manner. 
It’s a simulated chatbot primarily designed for customer service; people use it for various other purposes. But what is it? If you are new to this Chat GPT, this guide is for you, so continue reading. 
What’s Chat GPT?
Chat GPT is an AI chatbot auto-generative system created by Open AI for online customer care. It is a pre-trained generative chat, which makes use of (NLP) Natural Language Processing. The source of its data is textbooks, websites, and various articles, which it uses to model its own language for responding to human interaction.'''

In [303]:
doc_2='''What is Chat GPT and why is everyone talking about it? On Twitter, blogs, and at the office, Chat GPT has taken over the conversation in marketing. However, not everyone is a fan.
So what is Chat GPT? Who better to ask than Chat GPT itself? 
ChatGPT is a variant of the GPT (Generative Pre-training Transformer) language model specifically designed for generating text in a chatbot-like manner. It is trained on a large dataset of human-human conversations and can generate natural language responses to input prompts.
In other words, it is a smart AI technology that will spit out factual, informative and well-written responses to given prompts. The technology presents endless potential with many applications to marketing including customer service, eCommerce, entertainment, resourcing and more! Along with these benefits, many professionals are questioning what such a helpful tool means for working freelancers and industry professionals.'''

In [304]:
doc_3='''ChatGPT is a large language learning model that was designed to imitate human conversation. It can remember things you have said to it in the past and is capable of correcting itself when wrong.
It writes in a human-like way and has a wealth of knowledge because it was trained on all sorts of text from the internet, such as Wikipedia, blog posts, books, and academic articles.
It's easy to learn how to use ChatGPT, but what is more challenging is finding out what its biggest problems are. Here are some that are worth knowing about.
1. ChatGPT Isn't Always Right
It fails at basic math, can't seem to answer simple logic questions, and will even go as far as to argue completely incorrect facts. As social media users can attest, ChatGPT can get it wrong on more than one occasion.'''

In [305]:
doc_4='''Texting, chatting and online messaging can be used for much more than simply communicating with your friends. Online communication can help young people build and develop social skills and gives them a platform to share their skills and help each other out.
Messaging and texting are among the most popular methods of communication among children and teenagers. A study by Common Sense Media in 2018 found that 70% of teenagers report using social media multiple times a day.
Messaging and texting can be much more than ways to communicate. They can also be tools that help young people learn and master important skills.'''

In [306]:
# Creating the list

docs=[doc_1,doc_2,doc_3,doc_4]

In [307]:
print(docs)

['Chat GPT is a highly popular AI-based program that people use for generating dialogues. The chatbot has a language-based model that the developer fine-tunes for human interaction in a conversational manner. \nIt’s a simulated chatbot primarily designed for customer service; people use it for various other purposes. But what is it? If you are new to this Chat GPT, this guide is for you, so continue reading. \nWhat’s Chat GPT?\nChat GPT is an AI chatbot auto-generative system created by Open AI for online customer care. It is a pre-trained generative chat, which makes use of (NLP) Natural Language Processing. The source of its data is textbooks, websites, and various articles, which it uses to model its own language for responding to human interaction.', 'What is Chat GPT and why is everyone talking about it? On Twitter, blogs, and at the office, Chat GPT has taken over the conversation in marketing. However, not everyone is a fan.\nSo what is Chat GPT? Who better to ask than Chat GPT 

## Choosing the tokens

In [308]:
texts=[]# List of all tokens
for document in docs:
    doc=nlp(document)
    text=[] # List of tokens in the document
    for token in doc:
        if not token.is_stop and not token.is_punct and not token.like_num:
            text.append(token.lemma_)
    texts.append(text)

In [309]:
# Preprocessing is done, given by sir or remove stop words etc
# do for all possible preprocessing

In [310]:
print(texts)

[['Chat', 'GPT', 'highly', 'popular', 'AI', 'base', 'program', 'people', 'use', 'generating', 'dialogue', 'chatbot', 'language', 'base', 'model', 'developer', 'fine', 'tune', 'human', 'interaction', 'conversational', 'manner', '\n', 'simulated', 'chatbot', 'primarily', 'design', 'customer', 'service', 'people', 'use', 'purpose', 'new', 'Chat', 'GPT', 'guide', 'continue', 'read', '\n', 'Chat', 'GPT', '\n', 'Chat', 'GPT', 'AI', 'chatbot', 'auto', 'generative', 'system', 'create', 'Open', 'AI', 'online', 'customer', 'care', 'pre', 'train', 'generative', 'chat', 'make', 'use', 'NLP', 'Natural', 'Language', 'Processing', 'source', 'data', 'textbook', 'website', 'article', 'use', 'model', 'language', 'respond', 'human', 'interaction'], ['Chat', 'GPT', 'talk', 'Twitter', 'blog', 'office', 'Chat', 'GPT', 'take', 'conversation', 'marketing', 'fan', '\n', 'Chat', 'GPT', 'well', 'ask', 'Chat', 'GPT', '\n', 'ChatGPT', 'variant', 'GPT', 'Generative', 'pre', 'training', 'Transformer', 'language', 'm

In [312]:
# convert this list later into nlp doc using pipe

In [313]:
print(len(texts))

4


In [314]:
print(len(texts[0]))

76


In [315]:
print(len(texts[1]))

82


In [316]:
print(len(texts[2]))

67


In [317]:
print(len(texts[3]))

54


## Creation of a corpus

Corpus is a collection of tokens in a dictionary format.

In NLP, a corpus is often represented as a collection of tokens in a dictionary format, where each token corresponds to a word or phrase in the text, and its value is a numerical representation of its frequency or other linguistic property.

In [321]:
from gensim.corpora import Dictionary # from class corpus of gensim

dict_1=Dictionary(texts)
print(dict_1)

Dictionary(172 unique tokens: ['\n', 'AI', 'Chat', 'GPT', 'Language']...)


In [322]:
# Will give unique token

In [323]:
# Dict_1 is a corpus that contains tokens in dictionary format

## Giving an ID to each token

In [324]:
print(dict_1.token2id)

{'\n': 0, 'AI': 1, 'Chat': 2, 'GPT': 3, 'Language': 4, 'NLP': 5, 'Natural': 6, 'Open': 7, 'Processing': 8, 'article': 9, 'auto': 10, 'base': 11, 'care': 12, 'chat': 13, 'chatbot': 14, 'continue': 15, 'conversational': 16, 'create': 17, 'customer': 18, 'data': 19, 'design': 20, 'developer': 21, 'dialogue': 22, 'fine': 23, 'generating': 24, 'generative': 25, 'guide': 26, 'highly': 27, 'human': 28, 'interaction': 29, 'language': 30, 'make': 31, 'manner': 32, 'model': 33, 'new': 34, 'online': 35, 'people': 36, 'popular': 37, 'pre': 38, 'primarily': 39, 'program': 40, 'purpose': 41, 'read': 42, 'respond': 43, 'service': 44, 'simulated': 45, 'source': 46, 'system': 47, 'textbook': 48, 'train': 49, 'tune': 50, 'use': 51, 'website': 52, 'ChatGPT': 53, 'Generative': 54, 'Transformer': 55, 'Twitter': 56, 'application': 57, 'ask': 58, 'benefit': 59, 'blog': 60, 'conversation': 61, 'dataset': 62, 'eCommerce': 63, 'endless': 64, 'entertainment': 65, 'factual': 66, 'fan': 67, 'freelancer': 68, 'gene

In [325]:
# Created a unique value and dictionary

In [326]:
print(len(dict_1))

172


## Bag of words

The bag-of-words model is a technique used in natural language processing and information retrieval to represent text data as a collection of words or tokens. In this model, a text is represented as a bag (multiset) of its words, disregarding grammar, word order, and context, but keeping track of their frequency.

In [328]:
# To make bag of words we need to first do tokenization and vectorization

In [329]:
bow_vec=[]
for token in texts:
    bow_vec.append(dict_1.doc2bow(token))

In [330]:
print(bow_vec)

[[(0, 3), (1, 3), (2, 4), (3, 4), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 3), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 2), (29, 2), (30, 2), (31, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 2), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 4), (52, 1)], [(0, 3), (1, 1), (2, 4), (3, 5), (14, 1), (18, 1), (20, 1), (28, 2), (30, 2), (32, 1), (33, 1), (38, 1), (44, 1), (49, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 2), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 2), (86, 1), (87, 1), (88, 2), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (9

In [331]:
# Kitni baar a word is coming for a particular ID

In [347]:
# the token with has id 0, is repeated 3 times ---> interpretation of 1st bracket

## Creating BOW Matrix

In [332]:
from gensim.matutils import corpus2dense

bow_matrix=corpus2dense(bow_vec,num_terms=len(dict_1))

In [333]:
print(bow_matrix)

[[3. 3. 4. 2.]
 [3. 1. 0. 0.]
 [4. 4. 0. 0.]
 [4. 5. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 1. 0.]
 [1. 0. 0. 0.]
 [2. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 1.]
 [3. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [2. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 1. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [2. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [2. 2. 2. 0.]
 [2. 0. 0. 0.]
 [2. 2. 1. 0.]
 [1. 0. 0. 0.]
 [1. 1. 0. 0.]
 [2. 1. 1. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 2.]
 [2. 0. 0. 2.]
 [1. 0. 0. 1.]
 [1. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 1. 1. 0.]
 [1. 0. 0. 0.]
 [4. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 1. 2. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 1. 0.]
 [0. 2. 1. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0.

In [334]:
bow_matrix.shape # 172 distinct token and 4 docs

(172, 4)

In [335]:
# 172 unique values in 4 documents

In [336]:
# Woh word that is 'AI' pehle doc m 3 times aya h doosre m 3 times , tesre m 4 and 4th m 2 times

In [None]:
# term frequency = number of times the term is there 
# document frequency
# IDF: Inverse document frquency

## TFIDF Vectorisation

Term Frequency Inverse Document Frequency

TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is a technique used to represent text documents as a matrix of numbers. It is a more advanced technique than simple bag-of-words (BoW) representation, as it takes into account the importance of each word in the document and the corpus.

The basic idea behind TF-IDF vectorization is to weight the frequency of each word in the document by its importance in the corpus. The weight of each word is determined by two factors:

Term Frequency (TF): This measures the frequency of a word in a document. The more often a word appears in the document, the higher its TF value.

Inverse Document Frequency (IDF): This measures the importance of a word in the corpus. The rarer a word is in the corpus, the higher its IDF value.


In [337]:
# Vectorisation is done

In [338]:
from gensim.models import TfidfModel
tfidf=TfidfModel(bow_vec)

In [339]:
print(tfidf)

TfidfModel(num_docs=4, num_nnz=214)


In [340]:
tfidf_vec=[]
for vec in bow_vec:
    tfidf_vec.append(tfidf[vec])

In [341]:
print(tfidf_vec)

[[(1, 0.18920325824137527), (2, 0.25227101098850035), (3, 0.25227101098850035), (4, 0.12613550549425018), (5, 0.12613550549425018), (6, 0.12613550549425018), (7, 0.12613550549425018), (8, 0.12613550549425018), (9, 0.06306775274712509), (10, 0.12613550549425018), (11, 0.25227101098850035), (12, 0.12613550549425018), (13, 0.06306775274712509), (14, 0.18920325824137527), (15, 0.12613550549425018), (16, 0.12613550549425018), (17, 0.12613550549425018), (18, 0.12613550549425018), (19, 0.12613550549425018), (20, 0.026175482385303223), (21, 0.12613550549425018), (22, 0.12613550549425018), (23, 0.12613550549425018), (24, 0.12613550549425018), (25, 0.25227101098850035), (26, 0.12613550549425018), (27, 0.12613550549425018), (28, 0.05235096477060645), (29, 0.25227101098850035), (30, 0.05235096477060645), (31, 0.12613550549425018), (32, 0.06306775274712509), (33, 0.05235096477060645), (34, 0.12613550549425018), (35, 0.06306775274712509), (36, 0.12613550549425018), (37, 0.06306775274712509), (38, 0.

In [342]:
print(len(tfidf_vec))

4


print(gensim.__version__)

## Similarity of documents

After representing documents as vectors using techniques such as bag-of-words or TF-IDF, we can measure the similarity between them using distance metrics or similarity measures.

The MatrixSimilarity function in Gensim creates an index that can be used to perform similarity queries efficiently. It takes a corpus of documents (represented as vectors) and the number of features (which is the length of the dictionary used to create the vectors) as input.

In your example, tfidf_vec is the corpus of documents represented as TF-IDF vectors, and len(dict_1) is the number of features in the vectors (i.e., the length of the dictionary).


In [343]:
from gensim.similarities import MatrixSimilarity

In [344]:
sim=MatrixSimilarity(tfidf_vec,num_features=len(dict_1))

In [345]:
print(sim)

MatrixSimilarity<4 docs, 172 features>


In [346]:
print(sim[tfidf_vec[0]])

[0.9999999  0.18051012 0.03002641 0.02917842]


In [348]:
# similarity of 1 with 1,2,3,4 resp

In [38]:
print(sim[tfidf_vec[3]])

[0.02917842 0.00674681 0.02866974 0.9999999 ]


In [349]:
# 2 methods on vectorisation: tfidf and bad of words

## SESSION 7: TOPIC MODELLING

In [350]:
# Topic modelling- unsupervised
# topic classification- supervides

In [355]:
# Latent Dirichlet Allocation---> LDA

In [351]:
#!pip install pyLDAvis
import pyLDAvis

  from imp import reload


In [352]:
import pyLDAvis.gensim_models

In [353]:
import spacy
import gensim

In [356]:
nlp=spacy.load('en_core_web_sm')

## Accessing texts

In [357]:
text_1='''chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture.
Chess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship title and increasingly lucrative tournament prizes. Organized chess tournaments, postal correspondence games, and Internet chess now attract men, women, and children around the world.
This article provides an in-depth review of the history and the theory of the game by noted author and international grandmaster Andrew Soltis.
Characteristics of the game
Chess is played on a board of 64 squares arranged in eight vertical rows called files and eight horizontal rows called ranks. These squares alternate between two colours: one light, such as white, beige, or yellow; and the other dark, such as black or green. The board is set between the two opponents so that each player has a light-coloured square at the right-hand corner.
Algebraic notation
Individual moves and entire games can be recorded using one of several forms of notation. By far the most widely used form, algebraic (or coordinate) notation, identifies each square from the point of view of the player with the light-coloured pieces, called White. The eight ranks are numbered 1 through 8 beginning with the rank closest to White. The files are labeled a through h beginning with the file at White’s left hand. Each square has a name consisting of its letter and number, such as b3 or g8. Additionally, files a through d are referred to as the queenside, and files e through h as the kingside. See Figure 1.
Get a Britannica Premium subscription and gain access to exclusive content.
Subscribe Now
Moves
The board represents a battlefield in which two armies fight to capture each other’s king. A player’s army consists of 16 pieces that begin play on the two ranks closest to that player. There are six different types of pieces: king, rook, bishop, queen, knight, and pawn; the pieces are distinguished by appearance and by how they move. The players alternate moves, White going first.
King
White’s king begins the game on e1. Black’s king is opposite at e8. Each king can move one square in any direction; e.g., White’s king can move from e1 to d1, d2, e2, f2, or f1.
Rook
Each player has two rooks (formerly also known as castles), which begin the game on the corner squares a1 and h1 for White, a8 and h8 for Black. A rook can move vertically or horizontally to any unobstructed square along the file or rank on which it is placed.
Bishop
Each player has two bishops, and they begin the game at c1 and f1 for White, c8 and f8 for Black. A bishop can move to any unobstructed square on the diagonal on which it is placed. Therefore, each player has one bishop that travels only on light-coloured squares and one bishop that travels only on dark-coloured squares.
Queen
Each player has one queen, which combines the powers of the rook and bishop and is thus the most mobile and powerful piece. The White queen begins at d1, the Black queen at d8.
Knight
Each player has two knights, and they begin the game on the squares between their rooks and bishops—i.e., at b1 and g1 for White and b8 and g8 for Black. The knight has the trickiest move, an L-shape of two steps: first one square like a rook, then one square like a bishop, but always in a direction away from the starting square. A knight at e4 could move to f2, g3, g5, f6, d6, c5, c3, or d2. The knight has the unique ability to jump over any other piece to reach its destination. It always moves to a square of a different colour.
Capturing
The king, rook, bishop, queen, and knight capture enemy pieces in the same manner that they move. For example, a White queen on d3 can capture a Black rook at h7 by moving to h7 and removing the enemy piece from the board. Pieces can capture only enemy pieces.
Pawns
Each player has eight pawns, which begin the game on the second rank closest to each player; i.e., White’s pawns start at a2, b2, c2, and so on, while Black’s pawns start at a7, b7, c7, and so on. The pawns are unique in several ways. A pawn can move only forward; it can never retreat. It moves differently than it captures. A pawn moves to the square directly ahead of it but captures on the squares diagonally in front of it; e.g., a White pawn at f5 can move to f6 but can capture only on g6 or e6. An unmoved pawn has the option of moving one or two squares forward. This is the reason for another peculiar option, called en passant—that is, in passing—available to a pawn when an enemy pawn on an adjoining file advances two squares on its initial move and could have been captured had it moved only one square. The first pawn can take the advancing pawn en passant, as if it had advanced only one square. An en passant capture must be made then or not at all. Only pawns can be captured en passant. The last unique feature of the pawn occurs if it reaches the end of a file; it must then be promoted to—that is, exchanged for—a queen, rook, bishop, or knight.
Castling
The one exception to the rule that a player may move only one piece at a time is a compound move of king and rook called castling. A player castles by shifting the king two squares in the direction of a rook, which is then placed on the square the king has crossed. For example, White can castle kingside by moving the king from e1 to g1 and the rook from h1 to f1. Castling is permitted only once in a game and is prohibited if the king or rook has previously moved or if any of the squares between them is occupied. Also, castling is not legal if the square the king starts on, crosses, or finishes on is attacked by an enemy piece.
Relative piece values
Assigning the pawn a value of 1, the values of the other pieces are approximately as follows: knight 3, bishop 3, rook 5, and queen 9. The relative values of knights and bishops vary with different pawn structures. Additionally, tactical considerations may temporarily override the pieces’ usual relative values. Material concerns are secondary to winning.
Object of the game
When a player moves a piece to a square on which it attacks the enemy king—that is, a square from which it could capture the king if the king is not shielded or moved—the king is said to be in check. The game is won when one king is in check and cannot avoid capture on the next move; this is called checkmate. A game also can end when a player, believing the situation to be hopeless, acknowledges defeat by resigning.
There are three possible results in chess: win, lose, or draw. There are six ways a draw can come about: (1) by mutual consent, (2) when neither player has enough pieces to deliver checkmate, (3) when one player can check the enemy king endlessly (perpetual check), (4) when a player who is not in check has no legal move (stalemate), (5) when an identical position occurs three times with the same player having the right to move, and (6) when no piece has been captured and no pawn has been moved within a period of 50 moves.
In competitive events, a victory is scored as one point, a draw as half a point, and a loss as no points.'''

In [358]:
text_1

'chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture.\nChess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship t

In [359]:
text_2='''Chess computers were first able to beat strong chess players in the late 1980s. Their most famous success was the victory of Deep Blue over then World Chess Champion Garry Kasparov in 1997, but there was some controversy over whether the match conditions favored the computer.
In 2002–2003, three human–computer matches were drawn, but, whereas Deep Blue was a specialized machine, these were chess programs running on commercially available computers.
Chess programs running on commercially available desktop computers won decisive victories against human players in matches in 2005 and 2006. The second of these, against then world champion Vladimir Kramnik is (as of 2019) the last major human-computer match.
Since that time, chess programs running on commercial hardware—more recently including mobile phones—have been able to defeat even the strongest human players.
MANIAC (1956)
In 1956 MANIAC, developed at Los Alamos Scientific Laboratory, became the first computer to defeat a human in a chess-like game. Playing with the simplified Los Alamos rules, it defeated a novice in 23 moves.[1]
Mac Hack VI (1966–1968)
In 1966 MIT student Richard Greenblatt wrote the chess program Mac Hack VI using MIDAS macro assembly language on a Digital Equipment Corporation PDP-6 computer with 16K of memory. Mac Hack VI evaluated 10 positions per second.
In 1967, several MIT students and professors (organized by Seymour Papert) challenged Dr. Hubert Dreyfus to play a game of chess against Mac Hack VI. Dreyfus, a professor of philosophy at MIT, wrote the book What Computers Can’t Do, questioning the computer's ability to serve as a model for the human brain. He also asserted that no computer program could defeat even a 10-year-old child at chess. Dreyfus accepted the challenge. Herbert A. Simon, an artificial intelligence pioneer, watched the game. He said, "it was a wonderful game—a real cliffhanger between two woodpushers with bursts of insights and fiendish plans ... great moments of drama and disaster that go in such games." The computer was beating Dreyfus when he found a move which could have captured the enemy queen. The only way the computer could get out of this was to keep Dreyfus in checks with its own queen until it could fork the queen and king, and then exchange them. That is what the computer did. Soon, Dreyfus was losing. Finally, the computer checkmated Dreyfus in the middle of the board.
In the spring of 1967, Mac Hack VI played in the Boston Amateur championship, winning two games and drawing two games. Mac Hack VI beat a 1510 United States Chess Federation player. This was the first time a computer won a game in a human tournament. At the end of 1968, Mac Hack VI achieved a rating of 1529. The average rating in the USCF was near 1500.[2]
Chess x.x (1968–1978)
In 1968, Northwestern University students Larry Atkin, David Slate and Keith Gorlen began work on Chess (Northwestern University). On 14 April 1970 an exhibition game was played against Australian Champion Fred Flatow, the program running on a Control Data Corporation 6600 model. Flatow won easily. On 25 July 1976, Chess 4.5 scored 5–0 in the Class B (1600–1799) section of the 4th Paul Masson chess tournament in Saratoga, California. This was the first time a computer won a human tournament. Chess 4.5 was rated 1722. Chess 4.5 running on a Control Data Corporation CDC Cyber 175 supercomputer (2.1 megaflops) looked at less than 1500 positions per second. On 20 February 1977, Chess 4.5 won the 84th Minnesota Open Championship with 5 wins and 1 loss. It defeated expert Charles Fenner rated 2016. On 30 April 1978, Chess 4.6 scored 5–0 at the Twin Cities Open in Minneapolis. Chess 4.6 was rated 2040.[3] International Master Edward Lasker stated that year, "My contention that computers cannot play like a master, I retract. They play absolutely alarmingly. I know, because I have lost games to 4.7."[4]
David Levy's bet (1978)
Main article: David Levy (chess player) § Computer chess bet
For a long time in the 1970s and 1980s, it remained an open question whether any chess program would ever be able to defeat the expertise of top humans. In 1968, International Master David Levy made a famous bet that no chess computer would be able to beat him within ten years. He won his bet in 1978 by beating Chess 4.7 (the strongest computer at the time).
Cray Blitz (1981)
In 1981, Cray Blitz scored 5–0 in the Mississippi State Championship. In round 4, it defeated Joe Sentef (2262) to become the first computer to beat a master in tournament play and the first computer to gain a master rating (2258).[5]
HiTech (1988)
In 1988, HiTech won the Pennsylvania State Chess Championship with a score of 4½–½. HiTech defeated International Master Ed Formanek (2485).[6]
The Harvard Cup Man versus Computer Chess Challenge was organized by Harvard University. There were six challenges from 1989 until 1995. They played in Boston and New York City. In each challenge the humans scored higher and the highest scorer was a human'''

In [360]:
text_2

'Chess computers were first able to beat strong chess players in the late 1980s. Their most famous success was the victory of Deep Blue over then World Chess Champion Garry Kasparov in 1997, but there was some controversy over whether the match conditions favored the computer.\nIn 2002–2003, three human–computer matches were drawn, but, whereas Deep Blue was a specialized machine, these were chess programs running on commercially available computers.\nChess programs running on commercially available desktop computers won decisive victories against human players in matches in 2005 and 2006. The second of these, against then world champion Vladimir Kramnik is (as of 2019) the last major human-computer match.\nSince that time, chess programs running on commercial hardware—more recently including mobile phones—have been able to defeat even the strongest human players.\nMANIAC (1956)\nIn 1956 MANIAC, developed at Los Alamos Scientific Laboratory, became the first computer to defeat a human 

In [361]:
text_3='''"It's like a game of chess," we used to say in days gone by.
Every move our politicians made could be analysed and interpreted, not only for its significance in the wider electoral tournament but also for the possible moves, or false moves, it might induce from opponents.
That all seems rather quaint now, viewed from our present standpoint where the table on which the chess board so precariously sits is being shaken by a constant bombardment of violent impacts: the very survival of the UK, the war on Islamic State and our future in or out of the EU.
No wonder the chess pieces are wobbling already and could soon be simply tossed up in the air to land who knows where.
And now, of course, there is a new player in the game. UKIP thinks the contest has for too long been the preserve of the same exclusive club of elite players.
"This is an unpredictable election," Ed Miliband told me in what could well go down as the understatement of last week in Manchester.
There really are so many imponderables piled one on top of another that assessing the likely outcomes next May looks less and less like a chess game and more and more like a mug's game.
For a start, this era of coalition government means that with the two biggest parties short of a Commons majority, they both have rival sets of target seats on the go at the same time. So they are both in the position of having to fight defensive and offensive campaigns simultaneously.
Our list of marginal seats here in the West Midlands shows that behind their ultra-close knife-edge marginals, Labour have some other narrow majorities to worry about before they start taking the Conservatives' chessmen off the board.
These numbers underline the extend to which the Midlands has traditionally been a predominantly two-party contest.
Liberal Democrat Lorely Burt stunned the Conservatives when she "crept in under the radar" in 2005 but now she has her work cut out to defend a majority of just 175 over the Conservatives.
How will the emergence of the Green Party, now the official opposition on Solihull Council, affect the chances of the larger parties?
Lorely Burt's party colleague John Hemming has turned Birmingham Yardley into something of a personal fiefdom, but Labour will be fighting hard to overturn his 3,002 majority and regain the seat he captured from former Education Secretary Estelle Morris in 2005.
The Liberal Democrats' only other Midlands seat is in Cheltenham where Martin Horwood will defend a more comfortable majority of 4,920 over the Conservatives in what is still the home of the Midlands' only Liberal Democrat-controlled council.
With confident predictions that the Liberal Democrats will lose many seats, perhaps it is to their traditional "core" constituencies that they may have to turn as they fight to limit the damage.
And into this otherwise two-way political street comes the new kid on the blog, UKIP, arguably the biggest imponderable of all.
If, as Nigel Farage says, they are not a repository simply for disillusioned Tory votes but a genuine mass party with broad appeal with their "tanks on Labour's lawn", how might they upset the two-party chess board?
Take Dudley North, for example, where Labour's wafer-thin majority over the Conservatives faces the additional challenge of the UKIP candidate Bill Etheridge MEP, who has turned his borough into a local power base.
'Politically toxic'
On the other hand, how will the Conservatives' tiny majority over Labour in Warwickshire North fare against not just an experienced Labour candidate, former minister Mike O'Brien, but also UKIP who have made great play of opposition to high-speed rail in an area where HS2 has become, to mix my analogies, politically toxic?
Or could the Midlands UKIP vote shrink, as it did in 2010 to just 4% after a performance in the previous summer's European elections which was impressive but not as emphatic as their clear victory in this year's EU poll?
The evidence of the local elections on the same day was that if you simply weigh the votes, it is the Conservatives who lose the most. But if you apply those numbers to real council areas (and Parliamentary constituencies?} you see it was Labour who suffered last May, failing to win majorities in target councils like Tamworth, Walsall and Worcester.
But in the remaining eight months before polling, perhaps the greatest imponderable of all is what politicians call "events" in a region which has always been particularly prone to the ups and downs of the economy.
The Birmingham and Solihull Chamber of Commerce has just reported a continuing surge in business confidence.
"Try telling that to young people in my constituency" says Ian Austin, the Labour MP for that key seat of Dudley North, where unemployment remains well above the national average and where wages continue to lag behind prices.
Put all these chess pieces together and you can see why even those politics watchers with the longest memories say they have never known an election as difficult to predict as this.
It helps to explain why the mood clearly detectable in Labour's ranks last week in Manchester was more uncertain than hopeful; why the Conservatives, behind Labour in the polls for so long, nevertheless closed their conference in Birmingham with more than a sneaking feeling that they could confound the sooth-sayers; and why I have it on good authority that senior Liberal Democrats are preparing to embark on their conference in Glasgow in a mood of innermost trepidation.
As the end game draws near, a game of chess has never looked more like a game of chance.....'''

In [362]:
text_3

'"It\'s like a game of chess," we used to say in days gone by.\nEvery move our politicians made could be analysed and interpreted, not only for its significance in the wider electoral tournament but also for the possible moves, or false moves, it might induce from opponents.\nThat all seems rather quaint now, viewed from our present standpoint where the table on which the chess board so precariously sits is being shaken by a constant bombardment of violent impacts: the very survival of the UK, the war on Islamic State and our future in or out of the EU.\nNo wonder the chess pieces are wobbling already and could soon be simply tossed up in the air to land who knows where.\nAnd now, of course, there is a new player in the game. UKIP thinks the contest has for too long been the preserve of the same exclusive club of elite players.\n"This is an unpredictable election," Ed Miliband told me in what could well go down as the understatement of last week in Manchester.\nThere really are so many

In [363]:
# Create a list of texts

texts=[text_1,text_2,text_3]

In [364]:
print(texts)

['chess, one of the oldest and most popular board games, played by two opponents on a checkered board with specially designed pieces of contrasting colours, commonly white and black. White moves first, after which the players alternate turns in accordance with fixed rules, each player attempting to force the opponent’s principal piece, the King, into checkmate—a position where it is unable to avoid capture.\nChess first appeared in India about the 6th century AD and by the 10th century had spread from Asia to the Middle East and Europe. Since at least the 15th century, chess has been known as the “royal game” because of its popularity among the nobility. Rules and set design slowly evolved until both reached today’s standard in the early 19th century. Once an intellectual diversion favoured by the upper classes, chess went through an explosive growth in interest during the 20th century as professional and state-sponsored players competed for an officially recognized world championship 

## Creating a word list

In [365]:
words_list=[]
for text in texts:
    doc=nlp(text)
    text_words=[]
    for token in doc:
        if token.is_stop==False and token.is_punct==False and token.like_num==False and token.text!='\n':
            text_words.append(token.lemma_)
    words_list.append(text_words)

In [366]:
print(words_list)

[['chess', 'old', 'popular', 'board', 'game', 'play', 'opponent', 'checkered', 'board', 'specially', 'design', 'piece', 'contrast', 'colour', 'commonly', 'white', 'black', 'white', 'move', 'player', 'alternate', 'turn', 'accordance', 'fix', 'rule', 'player', 'attempt', 'force', 'opponent', 'principal', 'piece', 'King', 'checkmate', 'position', 'unable', 'avoid', 'capture', 'Chess', 'appear', 'India', 'century', 'ad', 'century', 'spread', 'Asia', 'Middle', 'East', 'Europe', 'century', 'chess', 'know', 'royal', 'game', 'popularity', 'nobility', 'rule', 'set', 'design', 'slowly', 'evolve', 'reach', 'today', 'standard', 'early', 'century', 'intellectual', 'diversion', 'favour', 'upper', 'class', 'chess', 'go', 'explosive', 'growth', 'interest', 'century', 'professional', 'state', 'sponsor', 'player', 'compete', 'officially', 'recognize', 'world', 'championship', 'title', 'increasingly', 'lucrative', 'tournament', 'prize', 'organized', 'chess', 'tournament', 'postal', 'correspondence', 'gam

In [367]:
print(len(words_list))

3


In [368]:
print(len(words_list[0]))

686


In [369]:
print(len(words_list[1]))

468


In [370]:
print(len(words_list[2]))

432


## Creating a corpus

In [371]:
corpus=[]
from gensim.corpora import Dictionary

In [372]:
dict=Dictionary(words_list)
type(dict)

gensim.corpora.dictionary.Dictionary

In [373]:
for word in words_list:
    corpus.append(dict.doc2bow(word))

In [374]:
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 6), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 13), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 2), (33, 1), (34, 2), (35, 1), (36, 1), (37, 2), (38, 3), (39, 1), (40, 1), (41, 1), (42, 2), (43, 1), (44, 1), (45, 1), (46, 2), (47, 1), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 8), (60, 1), (61, 1), (62, 1), (63, 12), (64, 3), (65, 6), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 6), (73, 14), (74, 5), (75, 1), (76, 5), (77, 1), (78, 1), (79, 5), (80, 1), (81, 3), (82, 6), (83, 1), (84, 1), (85, 3), (86, 6), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 1), (100, 1), (101, 2), (102, 1), (103, 1), (104, 1), (105, 1), (106, 2), (107, 2), (108, 1), (109, 1), (110,

In [375]:
len(corpus)

3

In [376]:
len(corpus[0])

325

In [377]:
len(corpus[1])

249

In [378]:
len(corpus[2])

304

## Creating an LDA model

In [379]:
lda=gensim.models.ldamodel.LdaModel(corpus=corpus,
                                   num_topics=5,
                                   id2word=dict)

In [380]:
# we give the number of topics

In [381]:
type(lda)

gensim.models.ldamodel.LdaModel

## Displaying topics

In [382]:
lda.print_topics() # 5 tuples bcz the topics given are 5

[(0,
  '0.014*"chess" + 0.013*"computer" + 0.012*"game" + 0.012*"piece" + 0.010*"square" + 0.009*"player" + 0.008*"king" + 0.007*"capture" + 0.007*"Chess" + 0.007*"move"'),
 (1,
  '0.011*"game" + 0.009*"square" + 0.009*"player" + 0.008*"chess" + 0.008*"Labour" + 0.008*"move" + 0.007*"piece" + 0.006*"capture" + 0.006*"pawn" + 0.006*"king"'),
 (2,
  '0.016*"player" + 0.014*"game" + 0.012*"chess" + 0.011*"square" + 0.011*"king" + 0.010*"computer" + 0.010*"piece" + 0.008*"Chess" + 0.008*"capture" + 0.008*"win"'),
 (3,
  '0.015*"chess" + 0.014*"computer" + 0.011*"game" + 0.008*"human" + 0.008*"Chess" + 0.007*"play" + 0.007*"win" + 0.006*"player" + 0.006*"Dreyfus" + 0.006*"Mac"'),
 (4,
  '0.023*"square" + 0.020*"player" + 0.018*"game" + 0.017*"pawn" + 0.014*"piece" + 0.013*"king" + 0.013*"move" + 0.012*"chess" + 0.012*"rook" + 0.010*"capture"')]

In [None]:
# 0.030 probability of the words belonging to that tuple set.

In [None]:
# It gives a unique theme, concept about these words

In [383]:
lda.print_topics()[:2]

[(0,
  '0.014*"chess" + 0.013*"computer" + 0.012*"game" + 0.012*"piece" + 0.010*"square" + 0.009*"player" + 0.008*"king" + 0.007*"capture" + 0.007*"Chess" + 0.007*"move"'),
 (1,
  '0.011*"game" + 0.009*"square" + 0.009*"player" + 0.008*"chess" + 0.008*"Labour" + 0.008*"move" + 0.007*"piece" + 0.006*"capture" + 0.006*"pawn" + 0.006*"king"')]

## Getting topics for a word

In [384]:
lda.get_term_topics('game')

[(0, 0.011295231), (2, 0.013283155), (3, 0.010231169), (4, 0.0173228)]

In [388]:
lda.get_term_topics('square')

[(2, 0.010346283), (4, 0.02180278)]

In [389]:
lda.get_term_topics('famous')

[]

In [392]:
# the word is a part of word_list but not topic model, so showing empty list

In [390]:
lda.get_term_topics('player')

[(2, 0.015119347), (4, 0.019275462)]

## Visualisation of topics

In [393]:
pyLDAvis.enable_notebook()

In [394]:
# Enable the notebook to have intercative visualisations
plot=pyLDAvis.gensim_models.prepare(lda,
                                    corpus=corpus,
                                   dictionary=lda.id2word)

  default_term_info = default_term_info.sort_values(


In [395]:
plot

In [None]:
# Blue part represents the total freq of occurence of the word, red shows the freq of occurence in the selected topic

In [None]:
# For topic modelling we should have a common topic for analysis

The relevance metric λ in topic modeling refers to a parameter that balances the weighting between the document-specific distribution of topics and the global distribution of topics when estimating the topic-word distributions. A value of λ = 1 prioritizes the document-specific distribution, while a value of λ = 0 prioritizes the global distribution. A value between 0 and 1 allows for a balance between the two.

#### Global distribution

In topic modeling, the global distribution refers to the distribution of topics across the entire corpus of documents being analyzed. It represents the frequency of each topic in the overall collection of documents, and is used in conjunction with the document-specific distribution of topics to estimate the probability of a given word being associated with a given topic. The global distribution is important for providing context and helping to identify which topics are most prominent across the entire corpus.

##### Document specific distribution

In topic modeling, the document-specific distribution refers to the distribution of topics within a particular document. It represents the probability of each topic being present in that specific document, and is used in conjunction with the global distribution of topics to estimate the probability of a given word being associated with a given topic. The document-specific distribution is important because it captures the unique topics and themes present within a specific document, allowing for more precise and nuanced modeling of the underlying topics within the corpus.