# Using Python for Research Homework: Week 3, Case Study 2

In this case study, we will find and plot the distribution of word frequencies for each translation of Hamlet.  Perhaps the distribution of word frequencies of Hamlet depends on the translation --- let's find out!

In [1]:
# DO NOT EDIT THIS CODE!
import os
import pandas as pd
import numpy as np
from collections import Counter

def count_words_fast(text):
    text = text.lower()
    skips = [".", ",", ";", ":", "'", '"', "\n", "!", "?", "(", ")"]
    for ch in skips:
        text = text.replace(ch, "")
    word_counts = Counter(text.split(" "))
    return word_counts

def word_stats(word_counts):
    num_unique = len(word_counts)
    counts = word_counts.values()
    return (num_unique, counts)

### Exercise 1 

In this case study, we will find and visualize summary statistics of the text of different translations of Hamlet. For this case study, functions `count_words_fast` and `word_stats` are already defined as in the Case 2 Videos (Videos 3.2.x).

#### Instructions 
- Read in the data as a pandas dataframe using `pd.read_csv`. Use the `index_col` argument to set the first column in the csv file as the index for the dataframe. The data can be found at https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@hamlets.csv

In [2]:
hamlets = pd.read_csv("https://courses.edx.org/asset-v1:HarvardX+PH526x+2T2019+type@asset+block@hamlets.csv",index_col=0)
print(hamlets)

     language                                               text
1     English  The Tragedie of Hamlet\n                      ...
2      German  Hamlet, Prinz von Dännemark.\n                ...
3  Portuguese  HAMLET\n                             DRAMA EM ...


### Exercise 2 

In this exercise, we will summarize the text for a single translation of Hamlet in a `pandas` dataframe. 

#### Instructions
- Find the dictionary of word frequency in `text` by calling `count_words_fast()`. Store this as `counted_text`.
- Create a `pandas` dataframe named `data`.
- Using `counted_text`, define two columns in data:
    - `word`, consisting of each unique word in text.
    - `count`, consisting of the number of times each word in `word` is included in the text.

In [32]:
language, text = hamlets.iloc[1]
counted_text=count_words_fast(text)
datas=(dict(counted_text))
data=pd.DataFrame(list(datas.items()),columns=('word','count'))
print(data)

                 word  count
0              hamlet    210
1               prinz     16
2                 von    212
3           dännemark     14
4                      67223
...               ...    ...
7476        tradeused      1
7477            sales      1
7478         hardware      1
7479          related      1
7480  permission]*end      1

[7481 rows x 2 columns]


### Exercise 3

In this exercise, we will continue to define summary statistics for a single translation of Hamlet. 

#### Instructions
- Add a column to data named `length`, defined as the length of each word.
- Add another column named `frequency`, which is defined as follows for each word in `data`:
    - If `count > 10`, `frequency` is "frequent".
    - If `1 < count <= 10`, `frequency` is "infrequent".
    - If `count == 1`, `frequency` is "unique".

In [33]:
data['length']=data.apply(lambda row: len(row.word), axis=1)
print(data)


                 word  count  length
0              hamlet    210       6
1               prinz     16       5
2                 von    212       3
3           dännemark     14       9
4                      67223       0
...               ...    ...     ...
7476        tradeused      1       9
7477            sales      1       5
7478         hardware      1       8
7479          related      1       7
7480  permission]*end      1      15

[7481 rows x 3 columns]


In [34]:
conditions = [
    (data['count'] == 1),
    (data['count'] > 1) & (data['count'] <= 10),
    (data['count'] > 10)
    ]
values = ['unique', 'infrequent', 'frequent']
data['frequency'] = np.select(conditions, values)
print(data)

                 word  count  length frequency
0              hamlet    210       6  frequent
1               prinz     16       5  frequent
2                 von    212       3  frequent
3           dännemark     14       9  frequent
4                      67223       0  frequent
...               ...    ...     ...       ...
7476        tradeused      1       9    unique
7477            sales      1       5    unique
7478         hardware      1       8    unique
7479          related      1       7    unique
7480  permission]*end      1      15    unique

[7481 rows x 4 columns]


In [35]:
Group=data.groupby('frequency').agg(np.size)
print(Group)

            word  count  length
frequency                      
frequent     303    303     303
infrequent  1596   1596    1596
unique      5582   5582    5582


In [36]:
Group=data.groupby('frequency').agg(np.mean)
print(Group)

                 count    length
frequency                       
frequent    271.590759  4.528053
infrequent    3.383459  6.481830
unique        1.000000  9.006987


### Exercise 4

In this exercise, we will summarize the statistics in data into a smaller pandas dataframe. 

#### Instructions 
- Create a `pandas` dataframe named `sub_data` including the following columns:
    - `language`, which is the language of the text (defined in Exercise 2).
    - `frequency`, which is a list containing the strings "frequent", "infrequent", and "unique".
    - `mean_word_length`, which is the mean word length of each value in frequency.
    - `num_words`, which is the total number of words in each frequency category.

In [21]:
Group=data.groupby('frequency').agg(np.mean)
print(Group)

                 count    length
frequency                       
frequent    203.182663  4.371517
infrequent    3.509015  5.825243
unique        1.000000  7.005675


In [None]:
def summarize_text(language, text):
    counted_text = count_words_fast(text)

    data = pd.DataFrame({
        "word": list(counted_text.keys()),
        "count": list(counted_text.values())
    })
    
    data.loc[data["count"] > 10,  "frequency"] = "frequent"
    data.loc[data["count"] <= 10, "frequency"] = "infrequent"
    data.loc[data["count"] == 1,  "frequency"] = "unique"
    
    data["length"] = data["word"].apply(len)
    
    sub_data = pd.DataFrame({
        "language": language,
        "frequency": ["frequent","infrequent","unique"],
        "mean_word_length": data.groupby(by = "frequency")["length"].mean(),
        "num_words": data.groupby(by = "frequency").size()
    })
    
    return(sub_data)
    
# write your code here!


In [None]:
colors = {"Portuguese": "green", "English": "blue", "German": "red"}
markers = {"frequent": "o","infrequent": "s", "unique": "^"}
import matplotlib.pyplot as plt
for i in range(grouped_data.shape[0]):
    row = grouped_data.iloc[i]
    plt.plot(row.mean_word_length, row.num_words,
        marker=markers[row.frequency],
        color = colors[row.language],
        markersize = 10
    )

color_legend = []
marker_legend = []
for color in colors:
    color_legend.append(
        plt.plot([], [],
        color=colors[color],
        marker="o",
        label = color, markersize = 10, linestyle="None")
    )
for marker in markers:
    marker_legend.append(
        plt.plot([], [],
        color="k",
        marker=markers[marker],
        label = marker, markersize = 10, linestyle="None")
    )
plt.legend(numpoints=1, loc = "upper left")

plt.xlabel("Mean Word Length")
plt.ylabel("Number of Words")
# write your code to display the plot here!