# Visualizing Adjectives from Children's Books
by Alicia Tu

## Introduction

For this project, I chose two children's books ---- Alice in Wonderland, and Anne of Green Gables, both accessed from Project Gutenberg. My goal is to visualize the frequency of usage of adjectives in the two children's books. From the visualization, I can see how are they different and similar to each other in the choice of words.

First, I accessed the books in txt from Project Gutenberg, and I used spacy to spot the adjectives in the sentences. With pandas, I created a data frame for each book respectively to organize the data, and I stored the data as csv files to be used in the future (It is not very necessary to store them in this case specifically since I can reference to the data frames I created directly, but it is always helpful to store csv files for future use). My data frame's colomns include: Author, Title, Text, and Adjectives.

The second step is to concat the two data frames I have. Since both books are not very long, I keep all the data in both data frames and combined them into a bigger data frame directly. From there I extracted out only the title and adjectives to create a small data frame to be used in the future steps.

Then I created corpus with scattertext, and visualize how the adjectives are used into a graph. The x-axis is Anne of Green Gables and the y-axis is Alice in Wonderland. All adjectives that has frequency over 5 will be shown onto the graph. It is displayed in html and I saved a version of the graph in html.

## Code

In [159]:
import requests
import pandas as pd
import spacy

In [160]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')

['ner', 'parser']

In [189]:
#load txt for alice in wonderland
response = requests.get('https://www.gutenberg.org/cache/epub/11/pg11.txt')
text_1 = response.text

In [190]:
text_1.find('CHAPTER I.\r\nDown the Rabbit-Hole')

1452

In [191]:
text_1.find('THE END')

148811

In [192]:
start = 1452
end = 148811

In [193]:
#remove the introduction and notes
tale = text_1[start:end]

In [210]:
tale_1_paras = tale.split('\r\n\r\n')

In [195]:
author = []
title = []

In [211]:
for para in tale_1_paras:
    author.append('Carroll')
    title.append('Alice')

In [214]:
alice_df = pd.DataFrame(list(zip(author, title, tale_1_paras)), columns=['author', 'title', 'text'])

In [215]:
alice_df.head()

Unnamed: 0,author,title,text
0,Carroll,Alice,CHAPTER I.\r\nDown the Rabbit-Hole
1,Carroll,Alice,\r\nAlice was beginning to get very tired of s...
2,Carroll,Alice,So she was considering in her own mind (as wel...
3,Carroll,Alice,There was nothing so _very_ remarkable in that...
4,Carroll,Alice,"In another moment down went Alice after it, ne..."


In [199]:
def process_text(text):
    # get the adjectives
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]

    adjectives = [token.lemma_.lower() for token in no_punct
                  if token.pos_ == 'ADJ']
    return ' '.join(adjectives)

In [200]:
alice_df['adj'] = alice_df['text'].apply(process_text)

In [201]:
alice_df.head()

Unnamed: 0,author,title,text,adj
0,Carroll,Alice,CHAPTER I.\r\nDown the Rabbit-Hole,
1,Carroll,Alice,\r\nAlice was beginning to get very tired of s...,tired
2,Carroll,Alice,So she was considering in her own mind (as wel...,hot sleepy stupid worth pink close
3,Carroll,Alice,There was nothing so _very_ remarkable in that...,remarkable late natural large
4,Carroll,Alice,"In another moment down went Alice after it, ne...",


In [202]:
def remove_new_lines(text):
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    return text

In [203]:
alice_df['text'] = alice_df['text'].apply(remove_new_lines)

In [204]:
alice_df.to_csv('AliceInWonderland.csv', index=False)

In [205]:
response = requests.get('https://www.gutenberg.org/cache/epub/45/pg45.txt')
text_2 = response.text

In [206]:
text_2.find('CHAPTER I. Mrs. Rachel Lynde is Surprised')

2882

In [207]:
text_2.find('START: FULL LICENSE')

573841

In [208]:
start = 2882
end: 573841

In [209]:
tale_2 = text_2[start:end]

In [212]:
tale_2_paras = tale_2.split('\r\n\r\n')

In [224]:
author= []
title = []

In [225]:
for para in tale_2_paras:
    author.append('Montgomery')
    title.append('Anne')

In [226]:
anne_df = pd.DataFrame(list(zip(author, title, tale_2_paras)), columns=['author', 'title', 'text'])

In [227]:
anne_df.head()

Unnamed: 0,author,title,text
0,Montgomery,Anne,CHAPTER I. Mrs. Rachel Lynde is Surprised
1,Montgomery,Anne,\r\nMRS. Rachel Lynde lived just where the Avo...
2,Montgomery,Anne,There are plenty of people in Avonlea and out ...
3,Montgomery,Anne,She was sitting there one afternoon in early J...
4,Montgomery,Anne,"And yet here was Matthew Cuthbert, at half-pas..."


In [228]:
anne_df['adj'] = anne_df['text'].apply(process_text)

In [229]:
anne_df.head()

Unnamed: 0,author,title,text,adj
0,Montgomery,Anne,CHAPTER I. Mrs. Rachel Lynde is Surprised,surprised
1,Montgomery,Anne,\r\nMRS. Rachel Lynde lived just where the Avo...,main little hollow old intricate early dark qu...
2,Montgomery,Anne,There are plenty of people in Avonlea and out ...,capable notable strong abundant awed sharp mai...
3,Montgomery,Anne,She was sitting there one afternoon in early J...,early bright bridal white little late big red
4,Montgomery,Anne,"And yet here was Matthew Cuthbert, at half-pas...",half busy hollow white good plain buggy consid...


In [230]:
anne_df['text'] = anne_df['text'].apply(remove_new_lines)

In [231]:
anne_df.to_csv('AnneOfGreenGables.csv', index=False)

In [241]:
%%capture
!pip install scattertext

In [179]:
import scattertext as st
from IPython.core.display import HTML

In [232]:
filename_1 = 'AliceInWonderland.csv'
alice_df = pd.read_csv(filename_1)
filename_2 = 'AnneOfGreenGables.csv'
anne_df = pd.read_csv(filename_2)

In [233]:
print(alice_df.shape)
print(anne_df.shape)

(822, 4)
(525, 4)


In [182]:
alice_df.sample(5)

Unnamed: 0,author,title,text,adj
370,Carroll,Alice,"Here the Dormouse shook itself, and began si...",
236,Carroll,Alice,The Fish-Footman began by producing from under...,great large solemn solemn little
202,Carroll,Alice,This time Alice waited patiently until it chos...,yawned tall short
179,Carroll,Alice,"“Repeat, “_You are old, Father William_,’” sai...",old
33,Carroll,Alice,“Curiouser and curiouser!” cried Alice (she ...,surprised good large good poor little sure abl...


In [234]:
anne_df.sample(5)

Unnamed: 0,author,title,text,adj
46,Montgomery,Anne,"“I don’t understand,” said Matthew helplessly,...",
132,Montgomery,Anne,“_Call_ you Cordelia? Is that your name?”,
498,Montgomery,Anne,Something warm and pleasant welled up in Maril...,warm pleasant thin little unaccustomedness normal
493,Montgomery,Anne,“How can I be vain when I know I’m homely?” pr...,vain pretty pretty sorrowful ugly beautiful
184,Montgomery,Anne,,


In [235]:
df = pd.concat([alice_df, anne_df])

In [245]:
adj_df = df[['title', 'adj']]

In [246]:
adj_df.head()

Unnamed: 0,title,adj
0,Alice,
1,Alice,tired
2,Alice,hot sleepy stupid worth pink close
3,Alice,remarkable late natural large
4,Alice,


In [247]:
#create scattertext corpus
corpus = st.CorpusFromPandas(adj_df, category_col='title', text_col='adj').build()

In [250]:
# transform corpus into html-based visualization with scattertext
html = st.produce_scattertext_explorer(corpus,
                                       category='Alice',  # this sets the y-axis
                                       category_name='Alice', # label y-axis
                                       not_category_name='Anne',  # label x-axis
                                       minimum_term_frequency=5,
                                       width_in_pixels=900)

In [251]:
HTML(html)

In [252]:
file_name = 'frequency.html'
with open(file_name, encoding='utf8', mode='w') as f:
  f.write(html)

## Discussion

In this project, I reviewed a few skills that I've learned in this class. The skills used are: pandas data frames, scattertext visualization, and use of spacy to process the texts.

The visualization I generated showed how is Alice in Wonderland and Anne of Green Gables different and similar to each other. In the graph, the words that are used by both of them are located in the middle. The closer the words are to the a-xis or the y-axis, the words are more specific to the book.

The books are both for children, and the words locating in the middle that are common to both are all nice and pleasant. For example: delightful, beautiful, and young. But for words are on the side, just from the words themselve, it is easy to see the source. For example, for words such as sleep, curious, large, they are very representative adjectives from Alice in Wonderland, and indeed they locate closer to the y-axis. On the contrary, words such as sorry, gray, red, which are representative for Anne of Green Gables, are located closer to the other axis.

From this visualization it is seen what are some popular adjectives for children's books, and they all have a happy theme. For the words that differ from each other, they are special to the theme of the work.