This kernel is just some exploratory analysis on the English version of the dataset.

1.   In the first version, I used the read_and_reformat function from [Sohier Dane](https://www.kaggle.com/sohier)'s kernel. Many thanks to Sohier Dane.

2.   I have added the Chapter or 'Para' column using help from [Wikipedia](https://en.wikipedia.org/wiki/Juz%27).

3.   I am not a huge fan of wordclouds, but I have added one just in case someone wants to explore it further.

4.   A dataframe named 'counts' has been created with a matrix of all words and their respective counts, anyone willing to explore further is welcome to use it.

Here it goes:

****The Dataset****

***Context:***
> The Holy Quran is the central text for 1.5 billion Muslims around the world. It literally means "The Recitation." It is undoubtedly the finest work in Arabic literature and revealed by Allah (God) to His Messenger Prophet Muhammed (Peace Be Upon Him) through angel Gabriel. It was revealed verbally from December 22, 609 (AD) to 632 AD (when Prophet Muhammed (Peace Be Upon Him) died).

> The book is divided into 30 parts, 114 Chapters and 6,000+ verses.

***Structure:***

The dataset contains the English translation of the entire Holy Quran as a csv file, with each verse as a separate row. The Surah(Chapter) number and the Ayah(Verse) number for each row is also provided.

***Goals of this Kernel:***

On first glance, we can see the data set is missing an important categorization parameter of the Quran - the Para(Part) for each verse.
    *Goal I: We will use data from Wikipedia to add this information to the dataset.*

It is common knowledge that the earlier chapters in the Quran are quite lengthy and the later ones relatively shorter.
    *Goal II: We will visualize the structure of chapters and verses to check if the trend holds true*

The pre-requisite for further work on this dataset is to have a list of all unique words in the entire text, along with number of times each word occurs.
    *Goal III: We will create a matrix of each unique word and its occurence for the entirety of the text.*
    
Wordclouds are often a goodway to have a quick glance at the most occuring words in the text.
    *Goal IV: We will generate a wordcloud for most frequently occuring words in the Holy Quran*


Lets get started!

In [None]:
import pandas as pd

df = pd.read_csv('../input/en.yusufali.csv')
df.info()

The dataframe is missing an important column: Part or Para

We add this new columns, using the Para-Surah distribution information, courtesy of Wikipedia:

In [None]:
for col in ['Surah', 'Ayah']:
    df[col] = pd.to_numeric(df[col])

def idx(i, j):
    df['index'] = df.index
    return int(df.loc[(df['Surah']==i) & (df['Ayah']==j), 'index'])

cut_points = [-1, idx(2,141), idx(2,252), idx(3,92), idx(4,23), idx(4,147), idx(5,81), idx(6,110), idx(7,87), idx(8,40),
             idx(9,92), idx(11,5), idx(12,52), idx(14,52), idx(16,128), idx(18,74), idx(20,135), idx(22,78), idx(25,20),
             idx(27,55), idx(29,45), idx(33,30), idx(36,27), idx(39,31), idx(41,46), idx(45,37), idx(51,30), idx(57,29),
             idx(66,12), idx(77,50), idx(114,6)]
label_names = [str(i) for i in range(1, len(cut_points))]

if 'Para' not in df.columns:
    df.insert(2, 'Para', pd.cut(df.index,cut_points,labels=label_names))
df.drop('index', axis=1, inplace=True)
df['Para'] = pd.to_numeric(df['Para'])
df.head()

With the Para column added as well,
lets have a bird's eye view of the length of Verses and how they change with Surahs:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig = plt.figure(figsize=(20,5))
ax = fig.add_subplot(1,1,1)
df.plot.scatter('Surah', 'Ayah', ax=ax)    

Seems like the Length of verses shortens in the later Surahs, with the first few Surahs being the longest.

Let's now try to check what words appear the most along the entirety of the text.
This task requires us to clean the Text column from each row from punctuations, as well as not differentiate between uppercase and lowercase words.

In [None]:
all_text = []
for text in df['Text']:
    all_text.append(text.split(' '))

punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")", "!"]
clean_text = []

for item in all_text:
    tokens = []
    for i in item:
        i = i.lower()
        for p in punctuation:
            i = i.replace(p, '')
        tokens.append(i)
    clean_text.append(tokens)

cleaned_rows = []
[cleaned_rows.append(' '.join(c)) for c in clean_text]
df['Clean Text'] = cleaned_rows
df.head()

Now that each row of the text is rid of punctuations and extra spaces, let's put up a matrix of words and count their times of occurance:

In [None]:
import numpy as np

unique_tokens = []
single_tokens = []

for tokens in clean_text:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token not in unique_tokens:
            unique_tokens.append(token)
            
counts = pd.DataFrame(0, index=np.arange(len(clean_text)), columns=unique_tokens)
for index, tokens in enumerate(clean_text):
    for token in tokens:
        if token in unique_tokens:
            counts.iloc[index][token] += 1
            
counts.head()

Again, I am not a fan of wordclouds but let's put together one wordcloud for the most occuring words in the entire scripture:

In [None]:
word_counts = counts.sum(axis=0)

from wordcloud import WordCloud

word_list = [word for word in word_counts.index]
wordcloud = WordCloud(max_font_size=40).generate(' '.join(word_list))

fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(1,1,1)
ax.imshow(wordcloud, interpolation="bilinear")
ax.axis("off")


Lets now try to put up a list of top most occuring words, excluding commonly used stopwords which includes conjunctions, prepositions etc:

In [None]:
stopwords = ['all', 'just', 'being', 'over', 'both', 'through', 'yourselves', 'its', 'before', 'herself', 'had',
             'should', 'to', 'only', 'under', 'ours', 'has', 'do', 'them', 'his', 'very', 'they', 'not', 'during',
             'now', 'him', 'nor', 'did', 'this', 'she', 'each', 'further', 'where', 'few', 'because', 'doing',
             'some', 'are', 'our', 'ourselves', 'out', 'what', 'for', 'while', 'does', 'above', 'between', 't',
             'be', 'we', 'who', 'were', 'here', 'hers', 'by', 'on', 'about', 'of', 'against', 's', 'or', 'own',
            'into', 'yourself', 'down', 'your', 'from', 'her', 'their', 'there', 'been', 'whom', 'too',
             'themselves', 'was', 'until', 'more', 'himself', 'that', 'but', 'don', 'with', 'than', 'those',
             'he', 'me', 'myself', 'these', 'up', 'will', 'below', 'can','theirs', 'my', 'and', 'then', 'is',
             'am', 'it', 'an', 'as', 'itself', 'at', 'have', 'in', 'any', 'if', 'again', 'no', 'when', 'same',
             'how', 'other', 'which', 'you', 'after', 'most', 'such', 'why', 'a', 'off', 'i', 'yours', 'so',
             'the', 'having', 'once', 'say', 'thou', 'said', 'shall', 'thee', 'us', 'ye', 'o', 'sent', 'thy',
             'come', 'see', 'made', 'give', 'may', ' ']

clean_word_counts = []
for word in word_counts.index:
    if word in stopwords:
        word_counts.drop(word, inplace=True)
        
word_counts.sort_values(ascending=False).head(20)

Let's look for the number of times some specific words occur in the text:

In [None]:
words = ['allah', 'truth', 'lie','man', 'men', 'woman', 'women', 'heaven', 'hell', 'paradise', 'hellfire', 'good', 'evil']

word_times = []
for w in words:
    if w not in word_times:
        word_times.append([w, word_counts[w]])
    
times_occurence = pd.DataFrame(data = word_times, columns=['Word', 'Times'])
times_occurence.head(20)

This is just a very small introduction to what can be done with this dataset, you are welcome to get the most used words for each Surah or Para

Regards,
Maaz Imran