# Text visualization techniques
## Introduction

<img src="images/library.png" style="width: 1000px;">

- most of the informations in the world is stored within lines of a book or a paper.
- library of congress has over 155.3 items which is a huge number for any human being to process if a lifetime.
- It would take 500 years reading 10 items/day for a human to read all the congress library, not to mention processing and visualizing.
- Text Mining is becoming a powerful field due the computational power of computers nowadays.
- Text visualization techniques makes it possible for humans to read the most important aspect of that data in a comparatively short period of time.

## Use Case Demonstrations

### Popularity of an N-gram overtime.
<img src="images/bigram.png"  style="width: 1000px;">

### Text Archeology
<img src="images/3grams.png" style="width: 1000px;">

### Relationship between words: Hashtags in tweets as an example
<img src="images/network.png" style="width: 1000px;">

# Arabic Songs Lyrics Dataset

In [26]:
import pandas as pd

In [27]:
arabic_lyrics = pd.read_csv("arabicLyrics.csv")
arabic_lyrics.head()

Unnamed: 0,songID,Singer,SongTitle,SongWriter,Composer,LyricsOrder,Lyrics,SingerNationality,SongDialect
0,1537,ابتسام,اروح لاحبابي,ملامح,بندر بن فهد,2,اروح لاحبابي والاقي الفرح ساكن عينهم,Morocco,Meghribi
1,1537,ابتسام,اروح لاحبابي,ملامح,بندر بن فهد,3,ابتسم لافراحهم وانا من الهم احترق,Morocco,Meghribi
2,1537,ابتسام,اروح لاحبابي,ملامح,بندر بن فهد,4,واسأل جروحي من ترى حس بعذابي منهم,Morocco,Meghribi
3,1537,ابتسام,اروح لاحبابي,ملامح,بندر بن فهد,5,وبالحقيقه انصدم محدن معه همي فرق,Morocco,Meghribi
4,1537,ابتسام,اروح لاحبابي,ملامح,بندر بن فهد,6,دورت في كل الوجيه حسيت غربه بينهم,Morocco,Meghribi


In [28]:
print("Data set Has {} Unique Songs ".format(arabic_lyrics["SongTitle"].nunique()))

Data set Has 25799 Unique Songs 


In [29]:
arabic_lyrics["SingerNationality"].value_counts()

Egypt           139193
Saudi Arabia     87822
Lebanon          78220
Iraq             70640
Sudan            44580
Kuwait           26518
Syria            23580
UAE              19263
Morocco          11846
Tunisia           5611
Yemen             5279
Jordan            4309
Algeria           3074
Qatar             2746
Bahrain           2508
Palestine         1116
Oman              1086
Libya              505
Name: SingerNationality, dtype: int64

### Grouping Verses of a song

In [30]:
all_lyrics_by_singer = arabic_lyrics.groupby(['songID','Singer','SongTitle','SongWriter','Composer','SingerNationality'])['Lyrics'].apply(lambda x: ' '.join(x)).reset_index()
all_lyrics_by_singer.head()

Unnamed: 0,songID,Singer,SongTitle,SongWriter,Composer,SingerNationality,Lyrics
0,1,مي سليم,إحلوّت الأيام,محمد عاطف,رامي جمال,Jordan,من يوم ماجيت على قلبى ناديت بهوالك حسيت فرحة ع...
1,2,هيثم يوسف,إنسي,ضياء الميالي,هيثم يوسف,Iraq,انسى ولا تعذب قلبك صدقني الحب يتعبك ما اريدك ي...
2,3,هيثم يوسف,احباب الروح,خضير هادي -حازم جابر,هيثم يوسف,Iraq,أحبـــاب الروح أحباب الروح جرحوني راحوا لبعيد ...
3,4,عبد المنعم العامري,الاسطورة,عبد الله مانع,طارق المقبل,UAE,ربي خلق في الكون الاف والاف بس بحالاتك ما خلق ...
4,5,الجان محمود عبد العزيز,الحنين,المعز فتح الرحمن,المعز فتح الرحمن,Sudan,انت ما قتلى لي... كلمتنى عيونى عن مشاعر صادقة ...


### Removing the punctuation

In [7]:
import re
import string
from nltk.tokenize import word_tokenize


def remove_punctuation(text):
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

    table = str.maketrans(' ', ' ', punctuation)
    words = [w.translate(table) for w in word_tokenize(text)]

    return ' '.join(words)

all_lyrics_by_singer["Lyrics"] = all_lyrics_by_singer["Lyrics"].apply(lambda x: remove_punctuation(x))

In [60]:
from collections import Counter
from nltk.tokenize import word_tokenize

bag_words = word_tokenize(' '.join(list(all_lyrics_by_singer["Lyrics"])))
bag_words_cnt = Counter(bag_words)

cnt.most_common()[:10]

[('يا', 55524),
 ('من', 50777),
 ('ما', 45702),
 ('في', 40450),
 ('انا', 32892),
 ('و', 30648),
 ('لا', 22067),
 ('على', 22056),
 ('ولا', 19150),
 ('كل', 19097)]

## TF for each tokenizer

Lines : words 
<br>
Column : TF

In [57]:
words_tf = pd.DataFrame.from_dict(bag_words_cnt, orient='index').reset_index()
words_tf.rename(columns={'index':'word', 0:'TF'})

Unnamed: 0,word,TF
0,من,50777
1,يوم,10869
2,ماجيت,71
3,على,22056
4,قلبى,3892
...,...,...
296765,ﻭﺑﻴﻚ,2
296766,ﻃﻮﻳﺖ,2
296767,ﺻﻔﺤﺔ,2
296768,ﻣﺎﺿﻲ,2


In [None]:
bag_words = word_tokenize(' '.join(list(all_lyrics_by_singer["Lyrics"])))
cnt = Counter(bag_words)

cnt.most_common()[:10]

In [None]:
import numpy as np

bag_words = {}
all_tokenizer = word_tokenize(' '.join(list(all_lyrics_by_singer["Lyrics"])))

for index, row in all_lyrics_by_singer[["SongTitle", "Lyrics"]].iterrows():
    lyrics_tokenizer = word_tokenize(row["Lyrics"])
    
    for word in lyrics_tokenizer:
        #occurrences = np.count_nonzero(all_tokenizer==word)
        bag_words[word] = all_tokenizer.count(word)

In [None]:
metrix_term = pd.DataFrame.from_dict(bag_words, orient='index', columns=["TF"])
metrix_term.tail()
# print(len(bag_words))
# # print(row)
# print("------------------------")

In [9]:
all_lyrics_by_song = list(all_lyrics_by_singer["Lyrics"])
songs_title = all_lyrics_by_singer['SongTitle']

In [None]:
metrix_term[metrix_term["TF"] == 0]

In [164]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(all_lyrics_by_song)

ok = pd.DataFrame(X.toarray().T, index=vectorizer.get_feature_names(), columns=list(songs_title))
ok

Unnamed: 0,اسمع كلامي,افهمني حبيبي,اكتر من سنه,الطفله البريئه,اللي بيني وبينك,أسرارنا,أنا ماشي ساهل,أنا و أنا,إرحمني,ابشرك,...,يايمه حبيته,يحاسبلي,يخليك للى,يعذبني,يلا نفرح,يلي هواك,يوم اقابلك فيه,يوم ما افترقنا,يوم من عمرنا,يوم ورا يوم
100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ai,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
amore,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
anlarda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ara,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ﻭﻳﻠﻲ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ﻳﺎ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ﻳﺎﺣﺒﻲ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ﻳﺪﻕ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# getting the frequency word in all lyrics

In [165]:
len(all_lyrics_by_song)

709

In [166]:
from collections import Counter
from nltk.tokenize import word_tokenize

bag_words = word_tokenize(str(all_lyrics_by_song))
cnt = Counter(bag_words)

cnt.most_common()[:100]

[('انا', 1266),
 ('ما', 1121),
 ('من', 1059),
 ('في', 971),
 ('يا', 962),
 ('و', 765),
 ("'", 710),
 (',', 708),
 ('ولا', 526),
 ('قلبي', 525),
 ('اللي', 498),
 ('لا', 495),
 ('كل', 471),
 ('مش', 463),
 ('انت', 433),
 ('لو', 418),
 ('على', 416),
 ('حبيبي', 375),
 ('يوم', 335),
 ('فى', 332),
 ('ده', 316),
 ('ايه', 301),
 ('ليه', 291),
 ('كان', 290),
 ('وانا', 289),
 ('عليك', 282),
 ('معاك', 262),
 ('كنت', 255),
 ('الله', 235),
 ('اللى', 224),
 ('غير', 215),
 ('بس', 214),
 ('أنا', 207),
 ('الحب', 196),
 ('هو', 195),
 ('وانت', 194),
 ('ليك', 186),
 ('عمري', 185),
 ('عليا', 185),
 ('ليا', 171),
 ('الدنيا', 169),
 ('كده', 167),
 ('لي', 164),
 ('حبك', 161),
 ('فيك', 154),
 ('قلبى', 150),
 ('حب', 142),
 ('تاني', 141),
 ('لك', 138),
 ('خلاص', 137),
 ('لما', 135),
 ('عليه', 134),
 ('او', 133),
 ('عيني', 130),
 ('بيك', 129),
 ('فيه', 127),
 ('آه', 124),
 ('القلب', 118),
 ('ان', 117),
 ('بعد', 113),
 ('فيها', 111),
 ('عشان', 109),
 ('عن', 109),
 ('قلبك', 108),
 ('دي', 107),
 ('اه', 105),
 ('الناس