# Characters Generation

In the first part of the code, we use the necessary functions to generate characters, whose characteristics will be analyzed in the second phase.

Currently, there are two different scripts, each containing two prompts: one for generating positively connotated characters and another for negatively connotated characters.

If necessary for the thesis, a third function will be added that takes as input some characteristics and gives as output the connotation (positive/negative).

In [8]:
import ollama #library that allows us to use Llama locally
import csv
import random
import json
import os

In [2]:
from functions import generate_characters_no_author
generate_characters_no_author(iterations=1) #since I've already generated more than 500 characters, I'll just place this function here as a placeholder

In [3]:
from functions import generate_characters
generate_characters(author=None, iterations=1)

# Characters Analysis

## No Author Analysis

In [11]:
import pandas as pd
import scipy.stats as stats
from scipy.stats import entropy
from scipy.spatial.distance import jensenshannon
from scipy.special import rel_entr
import numpy as np

In [6]:
from functions import process_character_data
process_character_data('characters_no_writer.json', keys_to_extract=None, writer="")

ethnicity data saved to csv\ethnicity_data.csv
Processed ethnicity data saved to csv\processed_ethnicity_data.csv
moral description data saved to csv\moraldescription_data.csv
Processed moral description data saved to csv\processed_moraldescription_data.csv
physical description data saved to csv\physicaldescription_data.csv
Processed physical description data saved to csv\processed_physicaldescription_data.csv
religion data saved to csv\religion_data.csv
Processed religion data saved to csv\processed_religion_data.csv
sex data saved to csv\sex_data.csv
Processed sex data saved to csv\processed_sex_data.csv


### Ethnicity

In [16]:
from functions import process_ethnicity_data
df = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_ethnicity_data.csv')
df

Unnamed: 0,ethnicity,connotation,count,relative_frequency
42,japanese,negative,99,0.197605
10,caucasian,negative,77,0.153693
80,russian,negative,45,0.08982
78,romanian,negative,39,0.077844
0,arab,negative,30,0.05988
38,irish,negative,30,0.05988
55,kurdish,negative,30,0.05988
12,celtic,negative,21,0.041916
28,gypsy,negative,11,0.021956
73,polish,negative,11,0.021956


As we can see, Japanese appears most frequently in both negative (99 instances, 19.76%) and positive (181 instances, 36.2%) categories, suggesting a strong but polarized representation. Other ethnicities such as Caucasian, Russian, and Punjabi also appear in both connotations but with varying frequencies. It's pretty strange how many japanese characters are generated, compared to other groups.

In [17]:
from functions import chi_square_test
result = chi_square_test(df, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(453.0931),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 22}

The chi-square test results indicate a statistically significant relationship between ethnicity and connotation (positive/negative):

Chi-Square Statistic = 453.0931 \
Degrees of Freedom = 22 \
p-value = 0.0 (effectively zero) 

Since the p-value is far below the standard significance threshold, we reject the null hypothesis, meaning that the distribution of positive and negative connotations across ethnicities is not random. This strongly suggests the presence of bias in how different ethnicities are associated with positive or negative adjectives in the dataset. Further investigation will help identify which ethnicities deviate most from expected distributions.

In [18]:
from functions import compute_standardized_residuals
result = compute_standardized_residuals(df, index_col='ethnicity')
result

connotation,negative,positive
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
arab,4.097051,-3.945341
caucasian,6.563812,-6.320759
celtic,3.427839,-3.300909
greek,-2.686474,2.586996
gypsy,2.480888,-2.389023
indian,-6.653205,6.406843
irish,4.097051,-3.945341
japanese,-3.077481,2.963525
polish,2.480888,-2.389023
punjabi,-2.872703,2.766329


These standardized residuals from the chi-square test indicate which ethnicities deviate significantly from expected distributions of positive and negative connotations. Key observations:

- Overrepresentation of negativity (positive residuals in the "negative" column, negative residuals in the "positive" column): 
    - Caucasian (6.56, -6.32), Romanian (4.67, -4.50), Irish (4.10, -3.95), Arab (4.10, -3.95), Gypsy (2.48, -2.39), Polish (2.48, -2.39), Celtic (3.43, -3.30) \
      These groups are disproportionately associated with negative connotations.

- Overrepresentation of positivity (negative residuals in the "negative" column, positive residuals in the "positive" column):
    - Indian (-6.65, 6.41), Sikh (-3.98, 3.84), Greek (-2.69, 2.59), Punjabi (-2.87, 2.77), Somali (-2.08, 2.00) \
      These groups are disproportionately associated with positive connotations.

- Mixed cases (Japanese (-3.08, 2.96)) show more balanced but still significant deviations.

Interpretation: \
These deviations suggest bias in sentiment distribution, with some ethnicities systematically skewed toward positive or negative characterizations. \
The strongest disparities appear for Caucasian (negative) and Indian (positive), indicating areas where bias might be particularly pronounced. 

In [19]:
from functions import compute_divergences
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='ethnicity')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.709721470703967)},
 'KL': {'KL(negative || positive)': np.float64(11.236501846761943),
  'KL(positive || negative)': np.float64(8.19796023576817)}}

These statistical measures provide insight into the **divergence** between the distributions of positive and negative connotations across ethnicities:

### **Jensen-Shannon Divergence (JSD)**
- **JSD(negative || positive) = 0.7097**  
  - JSD is a symmetric measure of distributional similarity, ranging from 0 (identical distributions) to 1 (completely different distributions).
  - A value of **0.7097** suggests a **high degree of divergence** between the distributions of negative and positive connotations. This indicates that the way positive and negative adjectives are distributed across ethnicities is **substantially different**.

### **Kullback-Leibler Divergence (KL)**
- **KL(negative || positive) = 11.2365**  
- **KL(positive || negative) = 8.1980**  
  - KL divergence is an asymmetric measure of how much one distribution diverges from another.
  - The fact that **KL(negative || positive) > KL(positive || negative)** suggests that the negative connotation distribution is more different from the positive distribution than vice versa. This could indicate that negative connotations are more concentrated among certain ethnicities, whereas positive connotations might be more evenly distributed.

### **Interpretation**
- The high **JSD** and **KL** values confirm that **ethnicity plays a strong role in shaping the sentiment distribution**, reinforcing the bias seen in the chi-square test.
- The asymmetry in **KL divergence** suggests that the **negative connotations may be more skewed** toward certain ethnicities than the positive ones.

### Sex

In [21]:
from functions import process_sex_data
df = process_sex_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_sex_data.csv')
df

Unnamed: 0,sex,connotation,count,relative_frequency
2,male,negative,497,0.992016
0,female,negative,4,0.007984
1,female,positive,297,0.594
3,male,positive,203,0.406


In [22]:
result = chi_square_test(df, index_col='sex')
result

{'Chi-Square Statistic': np.float64(405.9099),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 1}

In [23]:
result = compute_standardized_residuals(df, index_col='sex')
result

connotation,negative,positive
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,-11.948077,11.96002
male,7.834878,-7.842709


In [24]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='sex')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.6014721467174361)},
 'KL': {'KL(negative || positive)': np.float64(0.8518465487854805),
  'KL(positive || negative)': np.float64(2.197090119501996)}}

### Religion

In [25]:
from functions import process_religion_data
df = process_religion_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_religion_data.csv')
df

Unnamed: 0,religion,connotation,count,relative_frequency
3,atheism,negative,117,0.253247
38,shintoism,negative,81,0.175325
21,islamism,negative,57,0.123377
28,orthodoxism,negative,45,0.097403
10,christianity,negative,36,0.077922
8,catholicism,negative,29,0.062771
5,atheist,negative,21,0.045455
30,paganism,negative,19,0.041126
14,hinduism,negative,16,0.034632
43,sunniislam,negative,9,0.019481


In [26]:
result = chi_square_test(df, index_col='religion')
result

{'Chi-Square Statistic': np.float64(394.2351),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 14}

In [27]:
result = compute_standardized_residuals(df, index_col='religion')
result

connotation,negative,positive
religion,Unnamed: 1_level_1,Unnamed: 2_level_1
atheism,8.028262,-7.771702
atheist,3.401248,-3.292554
buddhism,-3.430985,3.321341
catholicism,2.332883,-2.258331
christianity,2.914093,-2.820967
hinduism,-6.811596,6.593917
islamism,2.476394,-2.397255
paganism,3.235232,-3.131843
shintoism,-2.669166,2.583867
sikhism,-4.230762,4.095559


In [28]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='religion')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.6219273103270238)},
 'KL': {'KL(negative || positive)': np.float64(8.318515404850475),
  'KL(positive || negative)': np.float64(2.6343018650557717)}}

### Physical

In [29]:
from functions import process_phy_data
df = process_phy_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_physicaldescription_data.csv')
df

Unnamed: 0,physicaldescription,connotation,count,relative_frequency
405,scrawny,negative,241,0.096671
332,pale,negative,232,0.093061
504,tall,negative,213,0.085439
266,lean,negative,169,0.067790
562,weathered,negative,105,0.042118
...,...,...,...,...
520,thin,positive,14,0.005632
144,eyes,positive,13,0.005229
235,highcheekbones,positive,13,0.005229
345,piercedeyes,positive,13,0.005229


In [30]:
result = chi_square_test(df, index_col='physicaldescription')
result

{'Chi-Square Statistic': np.float64(2652.3951),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 68}

In [31]:
result = compute_standardized_residuals(df, index_col='physicaldescription')
result

connotation,negative,positive
physicaldescription,Unnamed: 1_level_1,Unnamed: 2_level_1
balding,4.675393,-4.685399
bearded,3.152149,-3.158896
bony,3.229992,-3.236905
bright,-2.918591,2.924837
bright-eyed,-2.741539,2.747406
...,...,...
weathered,7.222482,-7.237939
wizened,2.541344,-2.546783
worn,3.306002,-3.313078
young,-5.390918,5.402456


In [32]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='physicaldescription')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.8358405401706462)},
 'KL': {'KL(negative || positive)': np.float64(11.924524392232907),
  'KL(positive || negative)': np.float64(14.029306987763642)}}

### Moral

In [34]:
from functions import process_mor_data
df = process_mor_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_moraldescription_data.csv')
df

Unnamed: 0,moraldescription,connotation,count,relative_frequency
117,manipulative,negative,344,0.1376
31,cruel,negative,335,0.134
151,ruthless,negative,171,0.0684
155,selfish,negative,155,0.062
32,cunning,negative,146,0.0584
135,power-hungry,negative,139,0.0556
1,ambitious,negative,101,0.0404
36,deceitful,negative,95,0.038
141,reckless,negative,72,0.0288
181,vindictive,negative,56,0.0224


In [35]:
result = chi_square_test(df, index_col='moraldescription')
result

{'Chi-Square Statistic': np.float64(4521.4541),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 56}

In [36]:
result = compute_standardized_residuals(df, index_col='moraldescription')
result

connotation,negative,positive
moraldescription,Unnamed: 1_level_1,Unnamed: 2_level_1
ambitious,4.39447,-4.293866
amoral,4.629603,-4.523616
arrogant,2.63928,-2.578858
authentic,-5.458367,5.333407
avaricious,3.190733,-3.117687
brave,-3.563563,3.481982
callous,4.392027,-4.291479
cold,4.392027,-4.291479
cold-hearted,3.354466,-3.277671
compassionate,-15.135062,14.788573


In [37]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='moraldescription')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.988317402502221)},
 'KL': {'KL(negative || positive)': np.float64(19.200290844548718),
  'KL(positive || negative)': np.float64(19.87708117457782)}}

## Author Analysis - Hemingway Tests

In [38]:
process_character_data('characters_output.json', keys_to_extract=None, writer="Ernest Hemingway")

ethnicity data saved to csv\ethnicity_data_hemingway.csv
Processed ethnicity data saved to csv\processed_ethnicity_data_hemingway.csv
moral description data saved to csv\moraldescription_data_hemingway.csv
Processed moral description data saved to csv\processed_moraldescription_data_hemingway.csv
physical description data saved to csv\physicaldescription_data_hemingway.csv
Processed physical description data saved to csv\processed_physicaldescription_data_hemingway.csv
religion data saved to csv\religion_data_hemingway.csv
Processed religion data saved to csv\processed_religion_data_hemingway.csv
sex data saved to csv\sex_data_hemingway.csv
Processed sex data saved to csv\processed_sex_data_hemingway.csv


### Ethnicity

In [39]:
df = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_ethnicity_data_hemingway.csv')
df

Unnamed: 0,ethnicity,connotation,count,relative_frequency
50,mexican,negative,144,0.360902
31,japanese,negative,68,0.170426
63,russian,negative,50,0.125313
52,mexican-american,negative,11,0.027569
71,spanish,negative,10,0.025063
28,irish,negative,9,0.022556
22,hispanic,negative,8,0.02005
25,indian,negative,8,0.02005
41,latinoamerican,negative,8,0.02005
15,cuban,negative,7,0.017544


In [40]:
result = chi_square_test(df, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(60.1201),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 17}

In [41]:
result = compute_standardized_residuals(df, index_col='ethnicity')
result

connotation,negative,positive
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
russian,2.497306,-3.415261


In [42]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='ethnicity')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.31554838434512783)},
 'KL': {'KL(negative || positive)': np.float64(0.48617537425936325),
  'KL(positive || negative)': np.float64(1.1607051948223843)}}

### Sex

In [43]:
df = process_sex_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_sex_data_hemingway.csv')
df

Unnamed: 0,sex,connotation,count,relative_frequency
2,male,negative,398,0.997494
0,female,negative,1,0.002506
3,male,positive,200,0.985222
1,female,positive,3,0.014778


In [44]:
result = chi_square_test(df, index_col='sex')
result

{'Chi-Square Statistic': np.float64(1.4922),
 'p-value': np.float64(0.2219),
 'Degrees of Freedom': 1}

In [45]:
result = compute_standardized_residuals(df, index_col='sex')
result

connotation,negative,positive
sex,Unnamed: 1_level_1,Unnamed: 2_level_1


In [46]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='sex')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.05923379207262645)},
 'KL': {'KL(negative || positive)': np.float64(0.007901138413626807),
  'KL(positive || negative)': np.float64(0.01402592611765316)}}

### Religion

In [47]:
df = process_religion_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_religion_data_hemingway.csv')
df

Unnamed: 0,religion,connotation,count,relative_frequency
7,catholicism,negative,151,0.467492
2,atheism,negative,67,0.20743
26,shintoism,negative,36,0.111455
9,christianity,negative,15,0.04644
19,orthodoxism,negative,15,0.04644
4,atheist,negative,11,0.034056
16,islamism,negative,8,0.024768
22,romancatholicism,negative,5,0.01548
8,catholicism,positive,104,0.541667
27,shintoism,positive,56,0.291667


In [48]:
result = chi_square_test(df, index_col='religion')
result

{'Chi-Square Statistic': np.float64(70.4113),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 10}

In [49]:
result = compute_standardized_residuals(df, index_col='religion')
result

connotation,negative,positive
religion,Unnamed: 1_level_1,Unnamed: 2_level_1
atheism,2.993287,-3.831288
shintoism,-2.795446,3.57806


In [50]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='religion')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.3563017388548288)},
 'KL': {'KL(negative || positive)': np.float64(1.0078512932526835),
  'KL(positive || negative)': np.float64(0.9312442201903928)}}

### Physical

In [51]:
df = process_phy_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_physicaldescription_data_hemingway.csv')
df

Unnamed: 0,physicaldescription,connotation,count,relative_frequency
405,weathered,negative,209,0.104762
365,tall,negative,199,0.099749
303,scrawny,negative,139,0.069674
205,lean,negative,138,0.069173
282,scarred,negative,88,0.04411
294,scars,negative,71,0.035589
62,crookednose,negative,63,0.031579
364,sunkeneyes,negative,63,0.031579
327,strong,negative,56,0.02807
377,tattooed,negative,33,0.016541


In [52]:
result = chi_square_test(df, index_col='physicaldescription')
result

{'Chi-Square Statistic': np.float64(607.6384),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 41}

In [53]:
result = chi_square_test(df, index_col='physicaldescription')
result

{'Chi-Square Statistic': np.float64(607.6384),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 41}

In [54]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='physicaldescription')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.530114657843018)},
 'KL': {'KL(negative || positive)': np.float64(6.659136311885535),
  'KL(positive || negative)': np.float64(2.3907304977845696)}}

### Moral

In [55]:
df = process_mor_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\processed_moraldescription_data_hemingway.csv')
df

Unnamed: 0,moraldescription,connotation,count,relative_frequency
236,selfish,negative,127,0.063659
37,cruel,negative,123,0.061654
171,manipulative,negative,94,0.047118
136,independent,negative,87,0.043609
112,haunted,negative,76,0.038095
...,...,...,...,...
234,self-reliant,positive,7,0.006897
237,selfish,positive,7,0.006897
244,sincere,positive,7,0.006897
135,impulsive,positive,6,0.005911


In [56]:
result = chi_square_test(df, index_col='moraldescription')
result

{'Chi-Square Statistic': np.float64(1241.3822),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 59}

In [57]:
result = compute_standardized_residuals(df, index_col='moraldescription')
result

connotation,negative,positive
moraldescription,Unnamed: 1_level_1,Unnamed: 2_level_1
amoral,2.306955,-3.101302
brave,-4.01178,5.393144
cold,2.129223,-2.862372
courageous,-2.892936,3.889052
cowardly,2.26383,-3.043328
cruel,4.923906,-6.61934
cynical,2.736837,-3.679205
deceitful,2.390871,-3.214113
disciplined,-3.497389,4.701634
drysenseofhumor,-2.122834,2.853784


In [58]:
connotations = df["connotation"].unique()
result = compute_divergences(df, connotations, index_col='moraldescription')
result

{'JSD': {'JSD(negative || positive)': np.float64(0.7128166249416509)},
 'KL': {'KL(negative || positive)': np.float64(10.744262781031955),
  'KL(positive || negative)': np.float64(4.553780662546629)}}

## Author vs No Author

In [59]:
process_character_data('characters_output.json', keys_to_extract=None, writer="Ernest Hemingway")
process_character_data('characters_no_writer.json', keys_to_extract=None, writer="")

ethnicity data saved to csv\ethnicity_data_hemingway.csv
Processed ethnicity data saved to csv\processed_ethnicity_data_hemingway.csv
moral description data saved to csv\moraldescription_data_hemingway.csv
Processed moral description data saved to csv\processed_moraldescription_data_hemingway.csv
physical description data saved to csv\physicaldescription_data_hemingway.csv
Processed physical description data saved to csv\processed_physicaldescription_data_hemingway.csv
religion data saved to csv\religion_data_hemingway.csv
Processed religion data saved to csv\processed_religion_data_hemingway.csv
sex data saved to csv\sex_data_hemingway.csv
Processed sex data saved to csv\processed_sex_data_hemingway.csv
ethnicity data saved to csv\ethnicity_data.csv
Processed ethnicity data saved to csv\processed_ethnicity_data.csv
moral description data saved to csv\moraldescription_data.csv
Processed moral description data saved to csv\processed_moraldescription_data.csv
physical description data sa

In [61]:
author = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\Processed_ethnicity_data_hemingway.csv')
no_author = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\Processed_ethnicity_data.csv')

In [62]:
no_author['author'] = 'No Author'
author['author'] = 'Author'

### Ethnicity Distribution

In [78]:
df_list = [author, no_author]
df_concat = pd.concat(df_list)
df_grouped = df_concat.groupby(["ethnicity", "author"])["count"].sum().reset_index()
df_grouped

Unnamed: 0,ethnicity,author,count
0,arab,No Author,30
1,basque,Author,11
2,caucasian,No Author,77
3,celtic,No Author,21
4,cuban,Author,11
5,french-canadian,No Author,5
6,greek,No Author,15
7,gujarati,No Author,20
8,gypsy,No Author,11
9,hispanic,Author,13


In [79]:
from functions import chi_square_test_authors
result = chi_square_test_authors(df_grouped, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(743.5843),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 34}

In [80]:
from functions import compute_standardized_residuals_authors
result = compute_standardized_residuals_authors(df_grouped, index_col='ethnicity')
result

author,Author,No Author
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
arab,-3.366009,2.622156
basque,3.358648,-2.616421
caucasian,-5.392622,4.200908
celtic,-2.816205,2.193853
cuban,3.358648,-2.616421
gujarati,-2.748335,2.140981
hispanic,3.651235,-2.84435
indian,-4.232824,3.297414
kurdish,-2.744485,2.137982
latinoamerican,3.507993,-2.732763


In [83]:
from functions import compute_divergences_authors
authors = df_grouped["author"].unique()
df_grouped["relative_frequency"] = df_grouped.groupby("author")["count"].transform(lambda x: x / x.sum())
result = compute_divergences_authors(df_grouped, authors, index_col='ethnicity')
result

{'JSD': {'JSD(No Author || Author)': np.float64(0.7030834032147633)},
 'KL': {'KL(No Author || Author)': np.float64(7.8045888919229),
  'KL(Author || No Author)': np.float64(5.339351395791128)}}

### Positive Connotation - Ethnicity

In [63]:
no_author_pos = no_author[no_author['connotation'] == 'positive']
author_pos = author[author['connotation'] == 'positive']

In [67]:
df_list = [author_pos, no_author_pos]
df_concat = pd.concat(df_list)
df_concat

Unnamed: 0,ethnicity,connotation,count,relative_frequency,author
32,japanese,positive,62,0.305419,Author
51,mexican,positive,58,0.285714,Author
47,mayan,positive,8,0.039409,Author
7,basque,positive,6,0.029557,Author
23,hispanic,positive,5,0.024631,Author
26,indian,positive,5,0.024631,Author
16,cuban,positive,4,0.019704,Author
29,irish,positive,4,0.019704,Author
37,korean,positive,4,0.019704,Author
42,latinoamerican,positive,4,0.019704,Author


In [69]:
result = chi_square_test_authors(df_concat, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(361.515),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 26}

In [70]:
result = compute_standardized_residuals_authors(df_concat, index_col='ethnicity')
result

author,Author,No Author
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
basque,3.234411,-2.064682
indian,-4.355819,2.780531
mayan,3.734776,-2.384089
mexican,10.056192,-6.419356


In [71]:
authors = df_concat["author"].unique()
result = compute_divergences_authors(df_concat, authors, index_col='ethnicity')
result

{'JSD': {'JSD(Author || No Author)': np.float64(0.7162417055910915)},
 'KL': {'KL(Author || No Author)': np.float64(12.23822955692644),
  'KL(No Author || Author)': np.float64(6.099929452122382)}}

### Negative Connotation  - Ethnicity

In [72]:
no_author_neg = no_author[no_author['connotation'] == 'negative']
author_neg = author[author['connotation'] == 'negative']

In [73]:
df_list = [author_neg, no_author_neg]
df_concat = pd.concat(df_list)
df_concat

Unnamed: 0,ethnicity,connotation,count,relative_frequency,author
50,mexican,negative,144,0.360902,Author
31,japanese,negative,68,0.170426,Author
63,russian,negative,50,0.125313,Author
52,mexican-american,negative,11,0.027569,Author
71,spanish,negative,10,0.025063,Author
28,irish,negative,9,0.022556,Author
22,hispanic,negative,8,0.02005,Author
25,indian,negative,8,0.02005,Author
41,latinoamerican,negative,8,0.02005,Author
15,cuban,negative,7,0.017544,Author


In [74]:
result = chi_square_test_authors(df_concat, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(436.3186),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 23}

In [75]:
result = compute_standardized_residuals_authors(df_concat, index_col='ethnicity')
result

author,Author,No Author
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
arab,-3.678756,3.335017
caucasian,-5.893669,5.34297
celtic,-3.077868,2.790275
gypsy,-2.227597,2.019453
hispanic,2.311486,-2.095503
indian,2.311486,-2.095503
kurdish,-2.715178,2.461474
latinoamerican,2.311486,-2.095503
mexican,9.194195,-8.335098
mexican-american,2.710458,-2.457195


In [76]:
authors = df_concat["author"].unique()
result = compute_divergences_authors(df_concat, authors, index_col='ethnicity')
result

{'JSD': {'JSD(Author || No Author)': np.float64(0.7362750744707991)},
 'KL': {'KL(Author || No Author)': np.float64(5.17976215858654),
  'KL(No Author || Author)': np.float64(10.364799326053703)}}

## Author vs Author

In [3]:
from functions import process_character_data
process_character_data('characters_output.json', keys_to_extract=None, writer="Edgar Allan Poe")
process_character_data('characters_output.json', keys_to_extract=None, writer="Jane Austen")

ethnicity data saved to csv\ethnicity_data_poe.csv
Processed ethnicity data saved to csv\processed_ethnicity_data_poe.csv
moral description data saved to csv\moraldescription_data_poe.csv
Processed moral description data saved to csv\processed_moraldescription_data_poe.csv
physical description data saved to csv\physicaldescription_data_poe.csv
Processed physical description data saved to csv\processed_physicaldescription_data_poe.csv
religion data saved to csv\religion_data_poe.csv
Processed religion data saved to csv\processed_religion_data_poe.csv
sex data saved to csv\sex_data_poe.csv
Processed sex data saved to csv\processed_sex_data_poe.csv
ethnicity data saved to csv\ethnicity_data_austen.csv
Processed ethnicity data saved to csv\processed_ethnicity_data_austen.csv
moral description data saved to csv\moraldescription_data_austen.csv
Processed moral description data saved to csv\processed_moraldescription_data_austen.csv
physical description data saved to csv\physicaldescription_d

In [5]:
from functions import process_ethnicity_data
poe = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\Processed_ethnicity_data_poe.csv')
austen = process_ethnicity_data(r'C:\Users\edoar\Documents\Tesi\thesis\csv\Processed_ethnicity_data_austen.csv')

In [6]:
poe['author'] = 'Poe'
austen['author'] = 'Austen'

### Ethnicity Distribution

In [12]:
df_list = [poe, austen]
df_concat = pd.concat(df_list)
df_grouped = df_concat.groupby(["ethnicity", "author"])["count"].sum().reset_index()
df_grouped

Unnamed: 0,ethnicity,author,count
0,afghan,Austen,8
1,african,Austen,77
2,african,Poe,3
3,africanamerican,Austen,8
4,africanamerican,Poe,32
5,arab,Poe,6
6,black,Austen,6
7,britishindian,Austen,4
8,cajun,Poe,21
9,caucasian,Austen,31


In [13]:
from functions import chi_square_test_authors
result = chi_square_test_authors(df_grouped, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(518.4741),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 32}

In [14]:
from functions import compute_standardized_residuals_authors
result = compute_standardized_residuals_authors(df_grouped, index_col='ethnicity')
result

author,Austen,Poe
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
african,4.349794,-5.209386
africanamerican,-3.206807,3.840526
cajun,-3.517562,4.21269
caucasian,-2.133219,2.554778
celtic,-2.427348,2.907033
english,4.509471,-5.400617
englishindian,3.426797,-4.103989
french-canadian,-3.432789,4.111165
greek,-2.030865,2.432198
indian,4.985434,-5.970639


In [15]:
from functions import compute_divergences_authors
authors = df_grouped["author"].unique()
df_grouped["relative_frequency"] = df_grouped.groupby("author")["count"].transform(lambda x: x / x.sum())
result = compute_divergences_authors(df_grouped, authors, index_col='ethnicity')
result

{'JSD': {'JSD(Austen || Poe)': np.float64(0.7521262230331562)},
 'KL': {'KL(Austen || Poe)': np.float64(7.64613464393532),
  'KL(Poe || Austen)': np.float64(7.483859300977419)}}

### Positive - Ethnicity

In [16]:
poe_pos = poe[poe['connotation'] == 'positive']
austen_pos = austen[austen['connotation'] == 'positive']

In [17]:
df_list = [poe_pos, austen_pos]
df_concat = pd.concat(df_list)
df_concat

Unnamed: 0,ethnicity,connotation,count,relative_frequency,author
33,indian,positive,21,0.104478,Poe
25,french-canadian,positive,16,0.079602,Poe
35,irish,positive,16,0.079602,Poe
13,caucasian,positive,15,0.074627,Poe
54,persian,positive,15,0.074627,Poe
11,cajun,positive,12,0.059701,Poe
3,africanamerican,positive,11,0.054726,Poe
47,kurdish,positive,9,0.044776,Poe
39,japanese,positive,8,0.039801,Poe
15,celtic,positive,7,0.034826,Poe


In [18]:
result = chi_square_test_authors(df_concat, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(209.7813),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 24}

In [19]:
result = compute_standardized_residuals_authors(df_concat, index_col='ethnicity')
result

author,Austen,Poe
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
african,2.811862,-2.762671
africanamerican,-2.324423,2.283759
cajun,-2.42778,2.385309
cree,2.178059,-2.139956
english,2.811862,-2.762671
englishindian,2.716517,-2.668994
french-canadian,-2.803359,2.754317
indian,4.263792,-4.189201
persian,-2.714341,2.666856


In [20]:
authors = df_concat["author"].unique()
result = compute_divergences_authors(df_concat, authors, index_col='ethnicity')
result

{'JSD': {'JSD(Poe || Austen)': np.float64(0.7673080624812966)},
 'KL': {'KL(Poe || Austen)': np.float64(12.600899879532538),
  'KL(Austen || Poe)': np.float64(7.803355110016076)}}

### Negative - Ethnicity

In [21]:
poe_neg = poe[poe['connotation'] == 'negative']
austen_neg = austen[austen['connotation'] == 'negative']

In [22]:
df_list = [poe_neg, austen_neg]
df_concat = pd.concat(df_list)
df_concat

Unnamed: 0,ethnicity,connotation,count,relative_frequency,author
34,irish,negative,88,0.437811,Poe
12,caucasian,negative,31,0.154229,Poe
2,africanamerican,negative,21,0.104478,Poe
10,cajun,negative,9,0.044776,Poe
46,kurdish,negative,9,0.044776,Poe
24,french-canadian,negative,4,0.0199,Poe
1,african,negative,3,0.014925,Poe
6,arab,negative,3,0.014925,Poe
14,celtic,negative,3,0.014925,Poe
28,gypsy,negative,3,0.014925,Poe


In [23]:
result = chi_square_test_authors(df_concat, index_col='ethnicity')
result

{'Chi-Square Statistic': np.float64(311.9793),
 'p-value': np.float64(0.0),
 'Degrees of Freedom': 20}

In [24]:
result = compute_standardized_residuals_authors(df_concat, index_col='ethnicity')
result

author,Austen,Poe
ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1
african,2.985641,-4.10746
africanamerican,-2.519432,3.466079
cajun,-2.426659,3.338447
english,3.198231,-4.399929
englishindian,2.22074,-3.055157
indian,3.676476,-5.057868
irish,-5.958347,8.197125


In [25]:
authors = df_concat["author"].unique()
result = compute_divergences_authors(df_concat, authors, index_col='ethnicity')
result

{'JSD': {'JSD(Poe || Austen)': np.float64(0.7804045552974412)},
 'KL': {'KL(Poe || Austen)': np.float64(4.068909359970863),
  'KL(Austen || Poe)': np.float64(12.68057068112229)}}