### Creating readability scores

To compute readability, we will calculate the average sentence length and average word length


**Note: Due to file size issues, we are only computing readability for a small selection of the total sample**

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

# Load the data from the CSV file
data = pd.read_csv("reduced_chapter_text.csv")

# Function to calculate average word length
def avg_word_length(text):
    words = word_tokenize(text)
    word_lengths = [len(word) for word in words]
    return sum(word_lengths) / len(word_lengths) if len(word_lengths) > 0 else 0

# Function to calculate average sentence length
def avg_sentence_length(text):
    sentences = sent_tokenize(text)
    sentence_lengths = [len(sent.split()) for sent in sentences]
    return sum(sentence_lengths) / len(sentence_lengths) if len(sentence_lengths) > 0 else 0

# Calculate average word length and average sentence length for each storyId
averages = data.groupby('storyId').apply(lambda x: pd.Series({
    'avg_word_length': x['text'].apply(avg_word_length).mean(),
    'avg_sentence_length': x['text'].apply(avg_sentence_length).mean()
}))

# Store averages for later use
averages.to_csv("story_averages.csv")

print("Average word and sentence lengths calculated and stored successfully.")


# Calculate average word length and average sentence length for storyId 63, to check functionality
averages = data[data['storyId'] == 63]
avg_word_length_63 = averages['text'].apply(avg_word_length).mean()
avg_sentence_length_63 = averages['text'].apply(avg_sentence_length).mean()

# Print the results of test
print("Average word length for storyId 63:", avg_word_length_63)
print("Average sentence length for storyId 63:", avg_sentence_length_63)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\readi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Average word and sentence lengths calculated and stored successfully.
Average word length for storyId 63: 3.6958509142053444
Average sentence length for storyId 63: 12.740112994350282


### Finding correlation between readability and popularity

Here, we compute the correlation between our two notions of readability and our four notions of popularity.

In [2]:
# Load the stored averages
averages = pd.read_csv("story_averages.csv")

# Load the popularity metrics
popularity_metrics = pd.read_csv("reduced_project_info.csv")

# Merge the averages and popularity metrics on storyId
merged_data = pd.merge(popularity_metrics, averages, left_on='id', right_on='storyId')

# Calculate correlations
correlations = merged_data[['avg_word_length', 'avg_sentence_length', 'hits', 'kudos', 'comments', 'bookmarks']].corr()

print("Correlation between Average Word Length, Average Sentence Length, and Popularity Metrics:")
print(correlations)

# Define a function to interpret correlation values
def interpret_correlation(correlation):
    if correlation > 0.7:
        return "Strong positive correlation"
    elif correlation > 0.3:
        return "Moderate positive correlation"
    elif correlation > -0.3:
        return "Weak or no correlation"
    elif correlation > -0.7:
        return "Moderate negative correlation"
    else:
        return "Strong negative correlation"

# Iterate over each pair of variables in the correlation matrix and interpret the correlation
for column1 in correlations.columns:
    for column2 in correlations.columns:
        if column1 != column2:
            correlation = correlations.loc[column1, column2]
            interpretation = interpret_correlation(correlation)
            print(f"\nCorrelation between {column1} and {column2}: {interpretation}")


Correlation between Average Word Length, Average Sentence Length, and Popularity Metrics:
                     avg_word_length  avg_sentence_length      hits     kudos  \
avg_word_length             1.000000             0.178947  0.007897  0.004962   
avg_sentence_length         0.178947             1.000000  0.013776  0.009300   
hits                        0.007897             0.013776  1.000000  0.935438   
kudos                       0.004962             0.009300  0.935438  1.000000   
comments                   -0.002383             0.005629  0.818407  0.749074   
bookmarks                   0.008979             0.009061  0.926273  0.943758   

                     comments  bookmarks  
avg_word_length     -0.002383   0.008979  
avg_sentence_length  0.005629   0.009061  
hits                 0.818407   0.926273  
kudos                0.749074   0.943758  
comments             1.000000   0.844439  
bookmarks            0.844439   1.000000  

Correlation between avg_word_length and 

### Result: Readability does not predict popularity

There is a weak or no correlation between our measures of readability and our measures of popularity, indicating that popularity can not be predicted by readability. Somewhat surprisingly, there is also no strong correlation between our two measures of readability however, there is a strong positive correlation between the different notions of popularity. 

### Finding the most and least readable works

Here we find and print the ID numbers for the work with the longest average sentence length, the shortest average sentence length, the longest average word length, and the shortest average word length.

In [3]:
# Load the stored averages
averages = pd.read_csv("story_averages.csv")

# Find the work with the maximum and minimum average word length
max_avg_word_length = averages.loc[averages['avg_word_length'].idxmax()]
min_avg_word_length = averages.loc[averages['avg_word_length'].idxmin()]

# Find the work with the maximum and minimum average sentence length
max_avg_sentence_length = averages.loc[averages['avg_sentence_length'].idxmax()]
min_avg_sentence_length = averages.loc[averages['avg_sentence_length'].idxmin()]

print("Work with Maximum Average Word Length:")
print(max_avg_word_length)

print("\nWork with Minimum Average Word Length:")
print(min_avg_word_length)

print("\nWork with Maximum Average Sentence Length:")
print(max_avg_sentence_length)

print("\nWork with Minimum Average Sentence Length:")
print(min_avg_sentence_length)


Work with Maximum Average Word Length:
storyId                103753.000000
avg_word_length             5.356742
avg_sentence_length        11.072289
Name: 4144, dtype: float64

Work with Minimum Average Word Length:
storyId                82910.000000
avg_word_length            2.720848
avg_sentence_length        8.631579
Name: 3374, dtype: float64

Work with Maximum Average Sentence Length:
storyId                32287.000000
avg_word_length            3.894379
avg_sentence_length      741.500000
Name: 1293, dtype: float64

Work with Minimum Average Sentence Length:
storyId                41540.000000
avg_word_length            3.281768
avg_sentence_length        4.100000
Name: 1671, dtype: float64


### Tag analysis with NLP
Project contains analysis of AO3 tags using NLP

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import accuracy_score
import nltk

data_tags = pd.read_csv('small_story_tags.csv')
data_tags

Unnamed: 0.1,Unnamed: 0,storyId,tagId,Tag Name,Tag Type
0,0,3,1,No Archive Warnings Apply,warning
1,1,3,2,Other,category
2,2,3,3,Viggo Mortensen/Orlando Bloom,relationship
3,3,3,4,Lord of the Rings RPF,fandom
4,4,3,5,Sean Bean,character
...,...,...,...,...,...
4189,4189,11615,1891,Brendon Urie,character
4190,4190,11615,1892,Ryan Ross,character
4191,4191,11615,1893,Jon Walker,character
4192,4192,11615,1894,Zack Hall,character


In [5]:
print(data_tags['Tag Type'].unique(),',',data_tags['Tag Type'].nunique())
print()
print(data_tags['Tag Name'].unique(),',',data_tags['Tag Name'].nunique())



 'Ryan Ross' 'Jon Walker' 'Zack Hall'] , 1892


In [6]:
tag_names = data_tags['Tag Name'].to_list()
tag_names = [str(tag) for tag in tag_names]
tag_names_str = ' '.join(tag_names)

CountV = CountVectorizer()
CountV_tag_names = CountV.fit_transform(tag_names)

X_train, X_test, y_train, y_test = train_test_split(CountV_tag_names, data_tags['Tag Type'], train_size=0.8)

tags_classifier = MultinomialNB()
tags_classifier.fit(X_train,y_train)
tags_pred = tags_classifier.predict(X_test)
print('Confusion Matrix:')
print(metrics.confusion_matrix(y_true = y_test, y_pred = tags_pred))
print(accuracy_score(y_test, tags_pred))


Confusion Matrix:
[[ 34   0   0  61   0   0]
 [ 18 125   5  11  36   0]
 [  3   5  88  15   0   2]
 [  7   0   9 218   0   4]
 [  2  66   2   1  13   0]
 [  0   0   0   0   0 114]]
0.7056019070321812


### Result: 
- The natural language learning model is able to predict the correct tag type based on the tag name with about 70% accuracy. Often, it would confuse "category" tags with "character" tags or "fandoms" with "warnings".

### % Of Restricted Works

To find the % of works that are restricted by their authors and thus not in our sample, we will go through the whole list of works and count the number of restricted ones.

In [11]:
# Load the data from the CSV file
data = pd.read_csv("not_reduced_project_info.csv")

# Calculate the percentage of restricted works
restricted_percentage = (data['restricted'].sum() / len(data)) * 100

print("Percentage of restricted works:", round(restricted_percentage, 2), "%")

Percentage of restricted works: 4.4 %


### Result: Restriction is Uncommon

Less than 5% of works in our sample are restricted.