# Regression Analysis on PH Data



We want to use Ordinary Least Squares (OLS) regression on the relationships between likes, dislikes and views in comparison with the sentiment of the tags used.

I believe the main goal here is to find (and plot) the relationships for both positive and negative tags through the above methods, then to compare them. Ultimately what we are trying to answer is which relationship is stronger, represented by the question: 'Do negative sentiment tags result in more views, likes and dislikes than trans videos with positive tags?'

To answer this we can use the sample I generated for the last checkpoint, which contains 100,000+ rows of only trans-related videos, with an additional column added to indicate the sentiment. This file can be found in the latest release on our group github.

For context, the rules of whether a video has positive or negative connotations is: 
- If even one derogatory tag is included, the video will be labelled as negative.
- If all tags for a video are positive or unrelated, then it will be labelled as positive.
- This is based on lists of words I created, not on sentiment analysis; this is because sentiment analysis libraries do not include many derogatory terms used against trans people. With that said, if you would like to change the rules/refine the system then feel free! The lists will also be included below.

Additionally, the sentiment column will be quantized below to avoid issues with comparing quantized variables to qualitative variables.

It is also worth testing other methods of analysis if OLS does not seem to fit well with the question.

### Importing tools

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats

### Loading csv and quantizing sentiment column

In [2]:
# Change the path to your own
df = pd.read_csv('~\Documents\phdb\PhDataFiltered.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,cleaned_title,tag_list,Categories,views,likes,dislikes,Title,Tags
0,0,"['transsexual', 'heartbreakers', 'scene']","['pornhub.com', 'blonde', 'milf', 'beauty', 't...",transgender,71525,284.0,33.0,transsexual heartbreakers 19 - scene 5,pornhub.com;blonde;milf;beauty;tranny;anal;ass...
1,1,"['big', 'tit', 'trannys', 'scene']","['pornhub.com', 'shemale', 'lady-boy', 'tranny...",transgender,11896,26.0,11.0,big tit trannys - scene 3,pornhub.com;shemale;lady-boy;tranny;transsexua...
2,2,"['unsafe', 'sex', 'transsexual', 'barebackin',...","['pornhub.com', 'tranny', 'strip', 'small-tits...",transgender,18339,58.0,9.0,unsafe sex with transsexual barebackin 1 - sce...,pornhub.com;tranny;strip;small-tits;brunette;l...
3,3,"['transsexual', 'heartbreakers', 'scene']","['pornhub.com', 'she-male', 'blonde', 'latina'...",transgender,47759,162.0,42.0,transsexual heartbreakers 8 - scene 4,pornhub.com;she-male;blonde;latina;transsexual...
4,4,"['sluts', 'packing', 'nuts', 'scene']","['pornhub.com', 'shemale', 'stockings', 'trann...",transgender,9803,19.0,7.0,sluts packing nuts 5 - scene 1,pornhub.com;shemale;stockings;tranny;tanlines;...


In [3]:
# Dropping unnecessary columns
df = df.drop('Unnamed: 0', axis = 1)
df = df.drop('Title', axis = 1)
df = df.drop('Tags', axis = 1)
df = df.drop('Categories', axis = 1)
df.columns

Index(['cleaned_title', 'tag_list', 'views', 'likes', 'dislikes'], dtype='object')

In [4]:
# Initializing word lists. Feel free to expand if you know more examples
neg_tags = ['shemale', 'tranny', 'sissy', 'ladyboy', 'trap', 'transsexual', 'transexual', 'shemales', 'trannies', 'sissies', 'ladyboys', 'traps', 'troon', 'troons'] 
pos_tags = ['transgender', 'trans', 'transgirl', 'transboy', 'transman', 'ftm', 'mtf', 'transgirls', 'transboys']

In [5]:
# Creating a sentiment column according to the rules; 
# 0 = negative sentiment, 1 = positive sentiment

def determine_sentiment(title, tags):
    words = title + tags

    # Check for any negative word
    if any(neg_word in words for neg_word in neg_tags):
        return 0
    
    # If no negatives, check for positive words
    elif any(pos_word in words for pos_word in pos_tags):
        return 1
    # If none exist (should be impossible) assigns -1
    return -1

# Applying func
df['sentiment'] = df.apply(lambda row: determine_sentiment(row['cleaned_title'], row['tag_list']), axis=1)
df['sentiment'].value_counts()

sentiment
0    86405
1    29663
Name: count, dtype: int64

### At this point, only the likes, dislikes, views and sentiment columns should be necessary

### One question is, should sentiment be proportional? As in, a number between 0 and 1 that shows the balance between negative and positive tags
It is likely that using a binary variable for comparison causes regression to be ineffective

In [12]:
# Attempting the above. Closer to 1 is more positive, closer to 0 is more negative

def determine_sentiment_proportion(title, tags):
    words = title + tags

    # Summing occurances of words for each sentiment list
    positive_count = sum(pos_word in words for pos_word in pos_tags)
    negative_count = sum(neg_word in words for neg_word in neg_tags)
    total_count = positive_count + negative_count

    # Return a proportion of positive words over the total count
    # If there are no positive elements, the calculation will result in zero (indicating fully negative)
    return positive_count / total_count

# Applying func
df['sentiment_proportion'] = df.apply(lambda row: determine_sentiment_proportion(row['cleaned_title'], row['tag_list']), axis=1)

df

Unnamed: 0,cleaned_title,tag_list,views,likes,dislikes,sentiment,sentiment_proportion
0,"['transsexual', 'heartbreakers', 'scene']","['pornhub.com', 'blonde', 'milf', 'beauty', 't...",71525,284.0,33.0,0,0.333333
1,"['big', 'tit', 'trannys', 'scene']","['pornhub.com', 'shemale', 'lady-boy', 'tranny...",11896,26.0,11.0,0,0.400000
2,"['unsafe', 'sex', 'transsexual', 'barebackin',...","['pornhub.com', 'tranny', 'strip', 'small-tits...",18339,58.0,9.0,0,0.333333
3,"['transsexual', 'heartbreakers', 'scene']","['pornhub.com', 'she-male', 'blonde', 'latina'...",47759,162.0,42.0,0,0.500000
4,"['sluts', 'packing', 'nuts', 'scene']","['pornhub.com', 'shemale', 'stockings', 'trann...",9803,19.0,7.0,0,0.000000
...,...,...,...,...,...,...,...
116063,"['crossdressing', 'model', 'anastasia', 'harri...","['anastasia-harris', 'crossdresser', 'sissy-cr...",51,0.0,0.0,0,0.666667
116064,"['destroying', 'bussy', 'huge', 'mr', 'hankey'...","['huge-dildo', 'huge-dildo-anal', 'huge-dildo-...",11,0.0,0.0,0,0.000000
116065,"['blonde', 'trap', 'stunner', 'karina', 'abelh...","['transsexual', 'shemale', 'tranny', 'blonde',...",135,1.0,0.0,0,0.200000
116066,"['slave', 'tranny']","['trans', 'tgirl', 'teaser', 'onlyfans-teaser'...",9,0.0,0.0,0,0.500000


### The sentiment_proportion column should be better suited for actually finding relationships, now we should be able to continue with analysis

In [13]:
# Continuing this later tonight