<a href="https://colab.research.google.com/github/python-for-data-analytic/data-science-in-economics/blob/master/005_text_mining_text_network_and_word_cloud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining - Text Network & Word Cloud

## Text Network Analysis

Though network analysis is most often used to describe relationships between people, some of the early pioneers of network analysis realized that it could also be applied to represent relationships between words. For example, one can represent a corpus of documents as a network where each node is a document, and the thickness or strength of the edges between them describes similarities between the words used in any two documents. Or, one can create a textnetwork where individual words are the nodes, and the edges between them describe the regularity with which they co-occur in documents.

There are multiple advantages to a network-based approach to automated text analysis. Just as clusters of social connections can help explain a range of outcomes, understanding patterns of connections between words helps identify their meaning in a more precise manner.Second, text networks can be built out of documents of any length, whereas topic models function poorly on short texts such as social media messages.

In this prcatice we will use NetworkX. NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. You can see the full documentation of NetworkX HERE

Here we construct a text network based on conversations about 'Demonetization in India'.

**Install & Import Libraries**

In [None]:
# Import Libraries
import numpy as np
import nltk
import itertools
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
from nltk import bigrams
from nltk.tokenize import word_tokenize
from random import seed

nltk.download('punkt')

**Import Data**

In [None]:
# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/text_preprocessed_short.csv', sep = ';')

# Show Data
df

In [13]:
# Convert to String
df['text']=df['text'].fillna('').apply(str)

In [None]:
# Select Text
text = df['text']
text

### **Preparing Adjacency Matrix**

In [None]:
# Tokenize
text_data = [word_tokenize(i) for i in text]
print(text_data)

In [None]:
# Create Fuction to show co occurrence
def generate_co_occurrence_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_index = {word: i for i, word in enumerate(vocab)}
 
    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))
 
    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
 
    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
 
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]
        pos_current = vocab_index[current]
        pos_previous = vocab_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
 
    # return the matrix and the index
    return co_occurrence_matrix, vocab_index

In [None]:
# Create one list using many lists
data = list(itertools.chain.from_iterable(text_data))
matrix, vocab_index = generate_co_occurrence_matrix(data)
 
 
data_matrix = pd.DataFrame(matrix, index=vocab_index,
                             columns=vocab_index)

# Show Adjacency Matrix
data_matrix.head()

In [None]:
data_matrix.info()

### **Constructing Text Network**

In [None]:
# Contstruct a Network
G = nx.from_pandas_adjacency(data_matrix)

# Visualize the Network
import matplotlib.pyplot as plt
plt.figure(figsize=(50,40))
nx.draw(G, with_labels=True, 
        node_color='skyblue', node_size=600, 
        arrowstyle='->',arrowsize=20, edge_color='r',
        font_size=7,
        pos=nx.kamada_kawai_layout(G))

### **Network Metrics and Measurement**

**Centrality Measurement**

In graph theory and network analysis, indicators of centrality identify the most important vertices within a graph. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, and super-spreaders of disease. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

In [None]:
# Degree Centrality
degree = nx.degree_centrality(G)

# Sorted from the Highest
sorted(nx.degree(G), key=lambda x: x[1], reverse=True)[0:10]

In [None]:
# Betweenness Centrality
betweenness = nx.betweenness_centrality(G)

# Sorted from the Highest
sorted(nx.betweenness_centrality(G, normalized=True).items(), key=lambda x:x[1], reverse=True)[0:10]

In [None]:
# Closeness Centrality
closeness = nx.closeness_centrality(G)

# Sorted from the Highest
sorted(nx.closeness_centrality(G).items(), key=lambda x:x[1], reverse=True)[0:10]

In [None]:
# Eigenvector Centrality
eigenvector = nx.eigenvector_centrality(G)

# Sorted from the Highest
sorted(nx.eigenvector_centrality(G).items(), key=lambda x:x[1], reverse=True)[0:10]

***Visualize Network based on Centrality Measurement***

In [None]:
# Set Degree Dictionary
d = dict(degree)

# Visualize the Network
import matplotlib.pyplot as plt
plt.figure(figsize=(50,40))
nx.draw(G, with_labels=True, 
        node_color='skyblue', nodelist=d.keys(),
        node_size=[v * 50000 for v in d.values()], 
        arrowstyle='->',arrowsize=20, edge_color='r',
        font_size=8,
        pos=nx.kamada_kawai_layout(G))

**Network Topology Measurement**

The configuration, or topology, of a network is key to determining its performance. Network topology is the way a network is arranged, including the physical or logical description of how links and nodes are set up to relate to each other.

In [None]:
# Show Number of Nodes
nx.number_of_nodes(G)

In [None]:
# Show Number of Edges
nx.number_of_edges(G)

In [None]:
# Show Graph Density
nx.density(G)

### **Community Detection**

Community detection is a fundamental problem in dividing text (modelled as nodes in a social graph) with certain word connections into densely knitted and highly related groups with each group well separated from different group members.

**Modularity Community**

In [None]:
# Import Module
from networkx.algorithms.community import greedy_modularity_communities

# Modularity Community Detection
communities_m = sorted(greedy_modularity_communities(G), key=len, reverse=True)
communities_m

In [None]:
# Set Node Community Function
def set_node_community(G, communities_m):
      '''Add community to node attributes'''
      for c, v_c in enumerate(communities_m):
        for v in v_c:
          # Add 1 to save 0 for external edges
          G.nodes[v]['community'] = c + 1      

In [None]:
# Set Colour Function
def get_color(i, r_off=1, g_off=1, b_off=1):
     '''Assign a color to a vertex.'''
     r0, g0, b0 = 0, 0, 0
     n = 16
     low, high = 0.1, 0.9
     span = high - low
     r = low + span * (((i + r_off) * 3) % n) / (n - 1)
     g = low + span * (((i + g_off) * 5) % n) / (n - 1)
     b = low + span * (((i + b_off) * 7) % n) / (n - 1)
     return (r, g, b) 

In [None]:
# Set Node Communities
community = set_node_community(G, communities_m)

# Set Node Color
node_color = [get_color(G.nodes[v]['community']) for v in G.nodes]

# Visualize the Network
import matplotlib.pyplot as plt
plt.figure(figsize=(50,40))
nx.draw(G, with_labels=True, 
        node_color = node_color, node_size=600, 
        arrowstyle='->',arrowsize=20, edge_color='r',
        font_size=7, map = plt.get_cmap('jet'),
        pos=nx.kamada_kawai_layout(G))

## Word Cloud

In [2]:
import wordcloud
import matplotlib.pyplot as plt

In [3]:
from wordcloud import WordCloud
from wordcloud import STOPWORDS

In [None]:
# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/dianrdn/data/master/text_preprocessed_short.csv', sep = ';')

# Show Data
df

In [15]:
cloud = WordCloud().generate(text)
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

TypeError: ignored

In [18]:
!pip install wikipedia

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=3e2d4369f93816fbd15cce1adc7145f854b43ec30b16e8f1e8e0d8a93667c319
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [19]:
import wikipedia
page = wikipedia.page("Natural Language Processing")
text1 = page.content

In [20]:
text1

'Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.\nChallenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\n\n\n== History ==\nThe history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.\nIn 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\nThe Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved