# Analysis Question-2 Readability 

Readability: write a job that computes Gunning Fog Index and Flesch-Kincaid Readability (both reading ease and grade level) of user comments. Then:
Choose a subreddit and plot the distribution of these scores.
Choose two subreddits focused on similar topics but with different views, e.g., /r/apple and /r/android. Use these metrics to compare the populations from both.

Lets start our work with giving definitions of Gunning Fog Index and Flesch-Kincaid Readability:


# What is Gunning Fog Index

The Gunning fog index is a readability test for English writing. The index estimates the years of formal education a person needs to understand the text on the first reading. For instance, a fog index of 12 requires the reading level of a United States high school senior (around 18 years old). The test was developed in 1952 by Robert Gunning, an American businessman who had been involved in newspaper and textbook publishing.

The fog index is commonly used to confirm that text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8.


*Complex Word: Count the "complex" words consisting of three or more syllables. Do not include proper nouns, familiar jargon, or compound words. Do not include common suffixes (such as -es, -ed, or -ing) as a syllable;


![alt text](https://i.imgur.com/S1wP7v0.jpg "Logo Title Text 1")

![alt text](https://i.imgur.com/3WshESB.jpg "Logo Title Text 1")



# What is Flesch-Kincaid Readability

The Flesch–Kincaid readability tests are readability tests designed to indicate how difficult a passage in English is to understand. There are two tests, the Flesch Reading Ease, and the Flesch–Kincaid Grade Level. Although they use the same core measures (word length and sentence length), they have different weighting factors.

The results of the two tests correlate approximately inversely: a text with a comparatively high score on the Reading Ease test should have a lower score on the Grade-Level test. Rudolf Flesch devised the Reading Ease evaluation; somewhat later, he and J. Peter Kincaid developed the Grade Level evaluation for the United States Navy.

![alt text](https://i.imgur.com/gMdtd3M.jpg "Logo Title Text 1")

The result is a number that corresponds with a U.S. grade level. The sentence, "The Australian platypus is seemingly a hybrid of a mammal and reptilian creature" is an 11.3 as it has 24 syllables and 13 words. The different weighting factors for words per sentence and syllables per word in each scoring system mean that the two schemes are not directly comparable and cannot be converted. The grade level formula emphasises sentence length over word length. By creating one-word strings with hundreds of random characters, grade levels may be attained that are hundreds of times larger than high school completion in the United States. Due to the formula's construction, the score does not have an upper bound.

The lowest grade level score in theory is −3.40, but there are few real passages in which every sentence consists of a single one-syllable word. Green Eggs and Ham by Dr. Seuss comes close, averaging 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs eight times, is the only exception.)


# Subreddits we choose

We queried these subreddits for our analysis: Soccer and NFL(The National Football League) We decided to get comments from all years for analyzing the data. We used sampled data as you can see in the Implementation part. (So we took the %1 of the total comments in our HDFS.)  



![alt text](https://i.imgur.com/B16BVBX.jpg "Logo Title Text 1")

# Results

As seen the below distribution chart, Soccer Author's grade level is slightly higher than the NFL Author's grade level. On the other hand, similarity between two sport authors grade level is undeniable. Both Soccer and NFL author grade level is mostly fit between Grade 12 and 15. (High school senior to College sophomore) 
Highest grade levels (College junior-College graduate) in Soccer has a proportion of 7.2%. On the other hand Highest grade levels in NFL is 5.1%. 
Lowest grade levels (Sixth grade-Eighth grade) in NFL has a proportion of 28.8%. Lowest grade levels in Soccer is 24.0%. 

As a result, Soccer authors seems smarter than NFL authors. So that makes us proud as a soccer lovers :) 

![alt text](https://i.imgur.com/H3LGzvh.jpg "Logo Title Text 1")

So what about Readability Levels?(Flesch–Kincaid readability tests) 

In this competition again soccer is slighlty head of its rival NFL, again. Most of the comments in Soccer is "Fairly Easy" with proportion of 24.9%.  Most of the comments in NFL is "Easy" to read with a proportion of 27.1%. Interesting result in this analysis is in both sports "Fairly Easy" proportions are same with 24.9%. 

Now lets talk about the extreme values. The comments that are "Very easy" to read so even an average 11-year-old student can understand easily proportion is 14.8% in NFL. This value is 10.5% in soccer. So what about "Very Diffult ro read" comments which are best understood by university graduates? Soccer comments has proportion of 2.7% in this aristocratic competition. On the other hand, NFL has only 1.8% grad level complexity in all comments. 


![alt text](https://i.imgur.com/eyXQBYj.jpg "Logo Title Text 1")

# Our Implementation

In [1]:
originDF = sqlContext.read.json("hdfs://orion11:11001/sampled_reddit/*")
df = originDF.sample(False, .1)
df.createGlobalTempView("Comments")

In [3]:
import re
def syllable_count(word):
    word = word.lower()
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count

def pre_process(text):
    # lowercase
    text=text.lower()
    
    # remove special characters and digits
    text=re.sub("(\\d|[^\\w|\\s]|(\_))+","",text)
    text=re.sub("(\\s)+"," ",text)
    #print(text)
    return text.strip()


In [4]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

samp = df.sample(False, .1)
#newsamp = samp.filter(df.subreddit.like("soccer")&~samp['author'].isin(['[deleted]'])).limit(5)
newsamp = samp.filter(df.subreddit.like("soccer")&~samp['author'].isin(['[deleted]']))
#newsamp.show(5)

[nltk_data] Downloading package punkt to /home4/saozdamar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
f=open("soccer_kincaidIndex.txt", "a+")
g=open("soccer_fogIndex.txt", "a+")

iteratebody = newsamp.select("body").rdd.flatMap(list).collect()
for i, row in enumerate(iteratebody):    
    complex_words=0
    #print(i, ".", row)
    number_of_sentences = sent_tokenize(row)
    #print("Number Of Sentences=",len(number_of_sentences))
    #print('Total words=   ', len(row.split()))        
    stops = row.count('.')
    stops = stops+row.count('?')
    stops = stops+row.count('!')    
    #print('total stops=    ', stops);
    row = pre_process(row)
    # Cleaning text and lower casing all words
    for char in '-.,\n':
        row=row.replace(char,' ')
        row = row.lower()
        # split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
        word_list = row.split()        
    total_syllables=0;    
    for i, word in enumerate(word_list):
    #    print(i, ".", word,",syllableCount=",syllable_count(word))        
        total_syllables = total_syllables+syllable_count(word)
        if(syllable_count(word)>=3):
            complex_words=complex_words+1            
        
    total_words=len(row.split())
    #print("total_syllables:",total_syllables)
    #print("complex_words:",complex_words)    
    #print("total_sentences:",len(number_of_sentences))
    total_sentences = len(number_of_sentences)
    
    
    if(total_sentences!=0 and total_words!=0):
        kincaidIndex = 206.835 - (1.015*(total_words/total_sentences)) - 84.6*(total_syllables/total_words)
        #print("(total_words/total_sentences):",(total_words/total_sentences))
        #print("(total_syllables/total_words):",(total_syllables/total_words))
        #print("(complex_words/words):",(complex_words/total_words))
        #print("total_words:" , total_words);
        #print("KincaidIndex:",kincaidIndex)    
        gunningIndex = 0.4* ((total_words/total_sentences)) + 100*(complex_words/total_words)
        #print("FogIndex:",gunningIndex)       
        
        if(gunningIndex>=6 and gunningIndex<=17):
            g.write("%d\r\n" % (gunningIndex))    
        
        if(kincaidIndex>=0 and kincaidIndex<=100):
            f.write("%d\r\n" % (kincaidIndex))        
            
f.close();
g.close();

In [7]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

samp = df.sample(False, .1)
#newsamp = samp.filter(df.subreddit.like("soccer")&~samp['author'].isin(['[deleted]'])).limit(5)
newsamp = samp.filter(df.subreddit.like("nfl")&~samp['author'].isin(['[deleted]']))
#newsamp.show(5)

[nltk_data] Downloading package punkt to /home4/saozdamar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
f=open("nfl_kincaidIndex.txt", "a+")
g=open("nfl_fogIndex.txt", "a+")

iteratebody = newsamp.select("body").rdd.flatMap(list).collect()
for i, row in enumerate(iteratebody):    
    complex_words=0
    #print(i, ".", row)
    number_of_sentences = sent_tokenize(row)
    #print("Number Of Sentences=",len(number_of_sentences))
    #print('Total words=   ', len(row.split()))        
    stops = row.count('.')
    stops = stops+row.count('?')
    stops = stops+row.count('!')    
    #print('total stops=    ', stops);
    row = pre_process(row)
    # Cleaning text and lower casing all words
    for char in '-.,\n':
        row=row.replace(char,' ')
        row = row.lower()
        # split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
        word_list = row.split()        
    total_syllables=0;    
    for i, word in enumerate(word_list):
    #    print(i, ".", word,",syllableCount=",syllable_count(word))        
        total_syllables = total_syllables+syllable_count(word)
        if(syllable_count(word)>=3):
            complex_words=complex_words+1            
        
    total_words=len(row.split())
    #print("total_syllables:",total_syllables)
    #print("complex_words:",complex_words)    
    #print("total_sentences:",len(number_of_sentences))
    total_sentences = len(number_of_sentences)
    
    
    if(total_sentences!=0 and total_words!=0):
        kincaidIndex = 206.835 - (1.015*(total_words/total_sentences)) - 84.6*(total_syllables/total_words)
        #print("(total_words/total_sentences):",(total_words/total_sentences))
        #print("(total_syllables/total_words):",(total_syllables/total_words))
        #print("(complex_words/words):",(complex_words/total_words))
        #print("total_words:" , total_words);
        #print("KincaidIndex:",kincaidIndex)    
        gunningIndex = 0.4* ((total_words/total_sentences)) + 100*(complex_words/total_words)
        #print("FogIndex:",gunningIndex)       
        
        if(gunningIndex>=6 and gunningIndex<=17):
            g.write("%d\r\n" % (gunningIndex))    
        
        if(kincaidIndex>=0 and kincaidIndex<=100):
            f.write("%d\r\n" % (kincaidIndex))        
            
f.close();
g.close();

# Notes
Output result files can be found in our Github repo.

# References

Wikipedia : https://en.wikipedia.org/wiki/Gunning_fog_index

Wikipedia : https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests