# 100_process_small_data

> This notebook loads and processes the toy data we will be using to smoketest fine-tuning the LLMs.

In [1200]:
import pandas as pd
import numpy as np

**First, let's load our data into Pandas dataframes and view them to make sure everything works.**

In [1201]:
# Read videos_small.csv into a DataFrame
videos_df = pd.read_csv('../data/raw/videos_small.csv')

# Read comments_small.csv into a DataFrame
comments_df = pd.read_csv('../data/raw/comments_small.csv')

# Read channels_small.csv into a DataFrame
channels_df = pd.read_csv('../data/raw/channels_small.csv')


  comments_df = pd.read_csv('../data/raw/comments_small.csv')


In [1202]:
channels_df.head(1)

Unnamed: 0,viewCount,subscriberCount,hiddenSubscriberCount,videoCount,channel_etag,channel_id,channel_title,channel_desc,created_date,channel_thumb,commentCount,scraped_date
0,488833.0,13800.0,0,17,lWDtFE6IVC1uuJUk6X6nxJM9G5A,UCmY71FGkk5kMwde_TP3KbnQ,Timothy Snyder,,2015-06-26T17:31:11Z,https://yt3.ggpht.com/ytc/AAUvwng0F0CDw54Jjqgc...,,1611753925


In [1203]:
videos_df.head(1)

Unnamed: 0,video_id,video_title,publication_date,video_description,channel_id,channel_title,viewCount,likeCount,dislikeCount,favoriteCount,commentCount,video_url,scraped_date
0,-rMfWXkz1OI,Rumored RE4 Remake has been DELAYED!!!,2021-01-23T20:00:01Z,This segment previously aired on a 01/22/2021 ...,UCv9X4TbvB6Eo0s7H20abg5g,Tipster,1485,79.0,5.0,0,45.0,https://www.youtube.com/watch?v=-rMfWXkz1OI,1611753925


In [1204]:
comments_df.head(1)

Unnamed: 0,video_id,textDisplay,authorDisplayName,authorProfileImageUrl,authorChannelUrl,authorChannelId,canRate,viewerRating,likeCount,publishedAt,updatedAt,comment_id,parentId,moderationStatus,scraped_date,channel_id
0,-rMfWXkz1OI,"RE2 was good, RE 3 is trash ( like the origina...",Optimized C,https://yt3.ggpht.com/ytc/AAUvwnjoO4ILJ5f_f1qa...,http://www.youtube.com/channel/UCApmTT4FP_paOo...,UCApmTT4FP_paOolvEUq3Drw,1,none,0,2021-01-25T17:23:45Z,2021-01-25T17:25:57Z,UgzxGyGWCRYlN6MsV5l4AaABAg,,,1611753925,UCv9X4TbvB6Eo0s7H20abg5g


Let's see what the shapes of our tables are:

In [1205]:
df_dic = {
    'channels':channels_df, 
    'videos':videos_df,
    'comments':comments_df
}

#Print Dimensions of Dics
print("Shapes: ")
for df in df_dic:
    print(f"{df}: {df_dic[df].shape}")

Shapes: 
channels: (1000, 12)
videos: (10000, 13)
comments: (100000, 16)


### For the smoketest, we are going to manually label progressive and conservative channels and select a subset of channels and their videos/comments

In [1206]:
np.unique(channels_df['channel_title'])

array(['(Active Research & Informed Opinion)',
       '13 Questions by Man Transcending', '6oodfella', '71 Republic',
       "A Legume of One's Own", 'A Voice For Men', 'A2Z TV',
       'ABitOfBritt', 'ADAM FRIENDED', 'AMTV', 'AQUAMARINEFACE',
       'ARealSJW', 'ASKDrBrown', 'AT2 Productions', 'AaronClarey',
       'Aarvoll', 'Abby Parmelee aka Abigail Beverly Hillz',
       'AbdultheImpailler', 'Abel Bodi', 'Acronym TV',
       'Acts17Apologetics', 'Actual Justice Warrior', 'Adam Carolla',
       'AdamKokesh', 'Adrian Salbuchi', 'Adventist Hermes Justin Wilson',
       'African Diaspora News Insider', 'Against the Odds', 'Age of Age',
       'AgentOfDoubt', 'AircraftSparky', 'Airliner World & More',
       'Akilah Obviously', 'Aldus Valor', 'AlfonZo Rachel',
       'Alien Queen Of Darkness', 'Alizee', 'Alli YAFF',
       'Allie Beth Stuckey', 'AlphaOmegaSin', 'AlternateFocus',
       'Amanee Powers', 'America Uncovered', 'American Anarchist',
       'American Enterprise Institute', '

Right now I will be using 'https://adfontesmedia.com/interactive-media-bias-chart/' and my own knowledge to pick out recognizable channels from the list above.

Right:
'American Thought Leaders - The Epoch Times',
'Ben Shapiro',
'Blaire White',
'Candace Owens',
'The Daily Wire',
'The Rubin Report',
'Turning Point USA',
'Fox News Insider',
'Jesse Lee Peterson',
'PragerU',
'NRA',
'TheQuartering',
'The Daily Signal',
'WeAreChange',
'The Jimmy Dore Show',
'Iraqveteran8888'

Left:
'AsapSCIENCE',
'David Pakman Show',
"TYT's The Conversation",
'The Majority Report w/ Sam Seder',
'h3h3Productions',
'H3 Podcast',
'HasanAbi',
'Vox',
'VICE',
'The Michael Brooks Show',
'The Andrew Schulz',
'Sam Harris',
'Vaush'

In [1207]:
channel_subset = {
    'right':[
        'American Thought Leaders - The Epoch Times',
        'Ben Shapiro',
        'Blaire White',
        'Candace Owens',
        'The Daily Wire',
        'The Rubin Report',
        'Turning Point USA',
        'Fox News Insider',
        'Jesse Lee Peterson',
        'PragerU',
        'NRA',
        'TheQuartering',
        'The Daily Signal',
        'WeAreChange',
        'The Jimmy Dore Show',
        'Iraqveteran8888'
    ],
    'left':[
        'AsapSCIENCE',
        'David Pakman Show',
        "TYT's The Conversation",
        'The Majority Report w/ Sam Seder',
        'h3h3Productions',
        'H3 Podcast',
        'HasanAbi',
        'Vox',
        'VICE',
        'The Michael Brooks Show',
        'The Andrew Schulz',
        'Sam Harris',
        'Vaush'
    ]
}

After futher assessment I've noticed only a small percentage of scraped channels have comments. :( This means we will have to choose from those channels.

These include:
'Tipster', 'TJump', 'TMM', 'Toad McKinley', 'Men Are Good!',
'RTR TRUTH MEDIA', 'Tom Secker', "Tommy C's SFTP", 'TomWoodsTV',
'Tony Kriz', 'ToolTime', 'The Aureus Press', 'Trae Crowder',
'Transition Radio Show', 'Tree Of Logic', 'Triggered Newscast',
'TrutherTalk', 'TruthRadioShow', 'Truth Teacher2007',
'Turd Flinging Monkey', 'Turning Point USA', 'Pleb Media',
'TylerPreston20', "TYT's The Conversation", 'TZMOfficialChannel',
'UK Column'

We will have to do some exploration to label them.

In [1208]:
channel_subset_updated = {
       'left': [
              'Tipster', 'TJump',
              'TMM', "Tommy C's SFTP",
              'Trae Crowder',
              "TYT's The Conversation", 
              'TZMOfficialChannel',
              'UK Column'
       ],
       'right': [
              'Men Are Good!', 'Tony Kriz',
              'TomWoodsTV', 'ToolTime',
              'Tree Of Logic',  'Triggered Newscast',
              'TrutherTalk',  'Turning Point USA',
              'TylerPreston20', 
              
              
       ]
}

Let's make a dataframe of comments.

In [1209]:
right_channels = channels_df.loc[channels_df['channel_title'].isin(channel_subset_updated['right'])]
right_channels['affiliation'] = 'R'
right_channels = right_channels[['channel_id','channel_title', 'affiliation']]
right_channels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  right_channels['affiliation'] = 'R'


Unnamed: 0,channel_id,channel_title,affiliation
6,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R
11,UCsgWR55UyAiFarZYl1u1l9Q,TomWoodsTV,R
12,UCBYddbOHrbGipDkQtDrJQxA,Tony Kriz,R
13,UCsu6NM_ARGWzDwL1C9xEhqA,ToolTime,R
18,UCl3RCEtooHD5bhPCHJw3btA,Tree Of Logic,R
19,UCLmEFIwfG3HSru_nUgOktGg,Triggered Newscast,R
22,UCamIYU1V5mlVPWQtgq-WgUw,TrutherTalk,R
26,UCVrK_pMRp_q8IelpfUCTGLQ,Turning Point USA,R
28,UCTrjPBx2WY3wsd1OKmMQv8w,TylerPreston20,R


In [1210]:
left_channels = channels_df.loc[channels_df['channel_title'].isin(channel_subset_updated['left'])]
left_channels['affiliation'] = 'L'
left_channels = left_channels[['channel_id','channel_title', 'affiliation']]
left_channels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  left_channels['affiliation'] = 'L'


Unnamed: 0,channel_id,channel_title,affiliation
1,UCv9X4TbvB6Eo0s7H20abg5g,Tipster,L
2,UCHXrvsK33VUEcpa4Ar0c0Sg,TJump,L
3,UCQb22imbIqKKWOC98C8Rm2A,TMM,L
10,UCzsqJV6eaQPlmaXgmYaw9kw,Tommy C's SFTP,L
16,UCTHsQd-vRXK1bp4vpifl6yA,Trae Crowder,L
30,UCKw8kdkYfmuNSVehGoDw8Mg,TYT's The Conversation,L
31,UCEwoFdqY09VwZFESGZ8Qp4A,TZMOfficialChannel,L
32,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L


In [1211]:
channel_subset_df = pd.concat([right_channels, left_channels])
channel_subset_df

Unnamed: 0,channel_id,channel_title,affiliation
6,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R
11,UCsgWR55UyAiFarZYl1u1l9Q,TomWoodsTV,R
12,UCBYddbOHrbGipDkQtDrJQxA,Tony Kriz,R
13,UCsu6NM_ARGWzDwL1C9xEhqA,ToolTime,R
18,UCl3RCEtooHD5bhPCHJw3btA,Tree Of Logic,R
19,UCLmEFIwfG3HSru_nUgOktGg,Triggered Newscast,R
22,UCamIYU1V5mlVPWQtgq-WgUw,TrutherTalk,R
26,UCVrK_pMRp_q8IelpfUCTGLQ,Turning Point USA,R
28,UCTrjPBx2WY3wsd1OKmMQv8w,TylerPreston20,R
1,UCv9X4TbvB6Eo0s7H20abg5g,Tipster,L


In [1212]:
channel_subset_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17 entries, 6 to 32
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   channel_id     17 non-null     object
 1   channel_title  17 non-null     object
 2   affiliation    17 non-null     object
dtypes: object(3)
memory usage: 544.0+ bytes


In [1213]:
videos_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   video_id           10000 non-null  object 
 1   video_title        10000 non-null  object 
 2   publication_date   10000 non-null  object 
 3   video_description  9530 non-null   object 
 4   channel_id         10000 non-null  object 
 5   channel_title      10000 non-null  object 
 6   viewCount          10000 non-null  int64  
 7   likeCount          9769 non-null   float64
 8   dislikeCount       9769 non-null   float64
 9   favoriteCount      10000 non-null  int64  
 10  commentCount       9684 non-null   float64
 11  video_url          10000 non-null  object 
 12  scraped_date       10000 non-null  int64  
dtypes: float64(3), int64(3), object(7)
memory usage: 1015.8+ KB


In [1214]:
channel_subset_with_comments = channels_df.merge(videos_df, on='channel_id',how='inner')
channel_subset_with_comments = channel_subset_with_comments.merge(comments_df, on='video_id', how='left')[['channel_id_x', 'channel_title_x', 'video_id', 'video_title', 'video_description', 'textDisplay', 'comment_id']]

In [1215]:
channel_subset_with_comments = channel_subset_df.merge(videos_df, on='channel_id',how='inner')
channel_subset_with_comments = channel_subset_with_comments.merge(comments_df, on='video_id', how='inner')[['channel_id_x', 'channel_title_x', 'affiliation', 'video_id', 'video_title', 'video_description', 'textDisplay', 'comment_id']]
channel_subset_with_comments

Unnamed: 0,channel_id_x,channel_title_x,affiliation,video_id,video_title,video_description,textDisplay,comment_id
0,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Men&#39;s generally poorer health from stress ...,Ugxtc_P8-7EmyHXAaF14AaABAg
1,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Is anyone on gab? I can&#39;t find people ther...,UgyhGIdaUug-1Ciyfbh4AaABAg
2,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Life sucks doesn&#39;t it ladies. He just up a...,UgwGR7A9eYedizYeCGZ4AaABAg
3,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",I have seen figures that show less death in th...,UgyIq8h56bQaUI-O55l4AaABAg
4,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",&quot;Waiting for the call for women to protec...,UgzLSxmvmakux6GzJG14AaABAg
...,...,...,...,...,...,...,...,...
89814,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","@Robert Green exactly, I live within walking d...",UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9IBX0mIauLj
89815,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",wanna discuss clap trap Brian?,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB9FsByn5z
89816,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","Which, when answered correctly leads nicely in...",UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB91oH6ug6
89817,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",WE will end there. Where? Who favoured Helleni...,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB-WsSjP9A


In [1216]:
pd.value_counts(channel_subset_with_comments.channel_title_x)

  pd.value_counts(channel_subset_with_comments.channel_title_x)


channel_title_x
Tipster                   26178
Turning Point USA         24891
Trae Crowder               8983
TrutherTalk                6390
TomWoodsTV                 4946
TJump                      4241
TYT's The Conversation     3885
TMM                        3584
Tommy C's SFTP             2631
Triggered Newscast         1638
UK Column                  1243
Men Are Good!               520
TylerPreston20              367
ToolTime                    160
TZMOfficialChannel           95
Tree Of Logic                64
Tony Kriz                     3
Name: count, dtype: int64

In [1217]:
channel_subset_with_comments = channel_subset_with_comments.rename(
    columns={'channel_id_x':'channel_id',
             'channel_title_x':'channel_title',
            'textDisplay':'comment_text'}
)
channel_subset_with_comments

Unnamed: 0,channel_id,channel_title,affiliation,video_id,video_title,video_description,comment_text,comment_id
0,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Men&#39;s generally poorer health from stress ...,Ugxtc_P8-7EmyHXAaF14AaABAg
1,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Is anyone on gab? I can&#39;t find people ther...,UgyhGIdaUug-1Ciyfbh4AaABAg
2,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Life sucks doesn&#39;t it ladies. He just up a...,UgwGR7A9eYedizYeCGZ4AaABAg
3,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",I have seen figures that show less death in th...,UgyIq8h56bQaUI-O55l4AaABAg
4,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",&quot;Waiting for the call for women to protec...,UgzLSxmvmakux6GzJG14AaABAg
...,...,...,...,...,...,...,...,...
89814,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","@Robert Green exactly, I live within walking d...",UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9IBX0mIauLj
89815,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",wanna discuss clap trap Brian?,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB9FsByn5z
89816,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","Which, when answered correctly leads nicely in...",UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB91oH6ug6
89817,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",WE will end there. Where? Who favoured Helleni...,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB-WsSjP9A


In [1218]:
channel_subset_with_comments.fillna('', inplace=True)

We now have a dataframe of comments labeled on being progressive or conservative. We also have video titles and descriptions which can help describe what the comments are talking about. I think that is a really important concept as this might be what dictates our fine-tuning approach. 

After a little exploration, it seems like it will be useful to process comments and video descriptions first.

### **Process comments and video descriptions.**

*Comment's first*

In [1219]:
pd.unique(channel_subset_with_comments['channel_title'])

array(['Men Are Good!', 'TomWoodsTV', 'Tony Kriz', 'ToolTime',
       'Tree Of Logic', 'Triggered Newscast', 'TrutherTalk',
       'Turning Point USA', 'TylerPreston20', 'Tipster', 'TJump', 'TMM',
       "Tommy C's SFTP", 'Trae Crowder', "TYT's The Conversation",
       'TZMOfficialChannel', 'UK Column'], dtype=object)

In [1220]:
special_char_vals = [i for i in range(32, 48)]+[i for i in range(58, 65)]+[i for i in range(91, 97)]+[i for i in range(123, 127)]
ascii_dict = {}
for i in special_char_vals:
    ascii_dict[f'&#{i};'] = chr(i)

In [1221]:
other_syms = """
&quot;	"	&#34;	quotation mark	u+0022 ISOnum	\0022	\42
&num;	#	&#35;	number sign	u+0023 ISOnum	\0023	\43
&dollar;	$	&#36;	dollar sign	u+0024 ISOnum	\0024	\44
&percnt;	%	&#37;	percent sign	u+0025 ISOnum	\0025	\45
&amp;	&	&#38;	ampersand	u+0026 ISOnum	\0026	\46
&apos;	'	&#39;	apostrophe	u+0027 ISOnum	\0027	\47
&lpar;	(	&#40;	left parenthesis	u+0028 ISOnum	\0028	\50
&rpar;	)	&#41;	right parenthesis	u+0029 ISOnum	\0029	\51
&ast;	*	&#42;	asterisk	u+002A ISOnum	\002a	\52
&plus;	+	&#43;	plus sign	u+002B ISOnum	\002b	\53
&comma;	,	&#44;	comma	u+002C ISOnum	\002c	\54
&minus;	-	&#45;	hyphen-minus	u+002D ISOnum	\002d	\55
&period;	.	&#46;	full stop; period	u+002E ISOnum	\002e	\56
&sol;	/	&#47;	solidus; slash	u+002F ISOnum	\002f	\57
&colon;	:	&#58;	colon	u+003A ISOnum	\003a	\72
&semi;	;	&#59;	semicolon	u+003B ISOnum	\003b	\73
&lt;	<	&#60;	less-than	u+003C ISOnum	\003c	\74
&equals;	=	&#61;	equals	u+003D ISOnum	\003d	\75
&gt;	>	&#62;	greater-than sign	u+003E ISOnum	\003e	\76
&quest;	?	&#63;	question mark	u+003F ISOnum	\003f	\77
&commat;	@	&#64;	at sign; commercial at	u+0040 ISOnum	\0040	\100
&lsqb;	[	&#91;	left square bracket	u+005B ISOnum	\005b	\133
&bsol;	\	&#92;	backslash	u+005C ISOnum	\005c	\134
&rsqb;	]	&#93;	right square bracket	u+005D ISOnum	\005d	\135
&Hat;	^	&#94;	circumflex accent	u+005E ISOnum	\005e	\136
&lowbar;	_	&#95;	low line	u+005F ISOnum	\005f	\137
&grave;	`	&#96;	grave accent	u+0060 ISOnum	\0060	\u0060
&lcub;	{	&#123;	left curly bracket	u+007b ISOnum	\007b	\173
&verbar;	|	&#124;	vertical bar	u+007c ISOnum	\007c	\174
&rcub;
"""
used_arr = other_syms.split('\t')
key = ""
value = ""
dic = {}
for i in range(len(used_arr)):
    if i % 6 == 0:
        key = used_arr[i]
        key = key.replace('\n', '')
        if i != 0:
            key = key[1:]
    elif i % 6 == 1:
        value = used_arr[i]
    else:
        continue
    dic[key]=value

In [1222]:
ascii_dict.update(dic)
    
def replace_ascii_codes(row):
    if type(row) != str:
        row = ''
        return row
    for code, char in ascii_dict.items():
        row = row.replace(code, char)
    return row

In [1223]:
channel_subset_with_comments.comment_text = channel_subset_with_comments.comment_text.map(replace_ascii_codes)
channel_subset_with_comments = channel_subset_with_comments[~channel_subset_with_comments['comment_text'].str.contains('@', na=False)]
#channel_subset_with_comments = channel_subset_with_comments[channel_subset_with_comments['comment_text'] != '']
channel_subset_with_comments.head(3)

Unnamed: 0,channel_id,channel_title,affiliation,video_id,video_title,video_description,comment_text,comment_id
0,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Men's generally poorer health from stress caus...,Ugxtc_P8-7EmyHXAaF14AaABAg
1,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...","Is anyone on gab? I can't find people there, p...",UgyhGIdaUug-1Ciyfbh4AaABAg
2,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Life sucks doesn't it ladies. He just up and d...,UgwGR7A9eYedizYeCGZ4AaABAg


In [1224]:
pd.unique(channel_subset_with_comments['channel_title'])

array(['Men Are Good!', 'TomWoodsTV', 'Tony Kriz', 'ToolTime',
       'Tree Of Logic', 'Triggered Newscast', 'TrutherTalk',
       'Turning Point USA', 'TylerPreston20', 'Tipster', 'TJump', 'TMM',
       "Tommy C's SFTP", 'Trae Crowder', "TYT's The Conversation",
       'TZMOfficialChannel', 'UK Column'], dtype=object)

In [1225]:
import re
def remove_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U00002702-\U000027B0"  # Dingbat symbols
                           u"\U000024C2-\U0001F251"  # Enclosed characters
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

channel_subset_with_comments['comment_text'] = channel_subset_with_comments['comment_text'].apply(remove_emojis)

def remove_html_tags(text):
    html_tag_pattern = re.compile('<.*?>')
    return html_tag_pattern.sub('', text)

channel_subset_with_comments['comment_text'] = channel_subset_with_comments['comment_text'].apply(remove_html_tags)
channel_subset_with_comments = channel_subset_with_comments.reset_index(drop=True)
channel_subset_with_comments.tail(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_subset_with_comments['comment_text'] = channel_subset_with_comments['comment_text'].apply(remove_emojis)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  channel_subset_with_comments['comment_text'] = channel_subset_with_comments['comment_text'].apply(remove_html_tags)


Unnamed: 0,channel_id,channel_title,affiliation,video_id,video_title,video_description,comment_text,comment_id
77028,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",Their are lies damn lies and British politicia...,UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9IDbTcSKMxR
77029,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",So many people with large family & friend netw...,UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9ICpFxtI6ro
77030,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",wanna discuss clap trap Brian?,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB9FsByn5z
77031,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","Which, when answered correctly leads nicely in...",UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB91oH6ug6
77032,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",WE will end there. Where? Who favoured Helleni...,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB-WsSjP9A


### *Now for the descriptions*

In [1226]:
x = 0
for i in pd.unique(channel_subset_with_comments['video_description']):
    if x > 5:
        break
    print(i)
    x+=1

Paul, Tom, and Janice discuss a New York Times article about widows of COVID-19, noticing its determination to view the deaths of men through a gynocentric lens. Everything is about the suffering of women, even men's deaths. Then we acknowledge the remarkable work of Mark Perry on gender reality, looking at two charts about the gender gap in occupational fatalities, something that occurs year after year with almost zero public notice. 

https://www.nytimes.com/2020/12/31/us/covid-widows-deaths.html?smid=fb-nytimes&smtyp=cur&fbclid=IwAR0jevioPHKN5tgCZ5nL_GyRf6UZVgPuC2pi6_5OTQu7PnROhTvTEbhzrg4 

https://www.aei.org/carpe-diem/wednesday-evening-links-all-chart-edition/ 

https://youtu.be/icV-V73ZjRI


Tom's Subscribestar
https://www.subscribestar.com/red-pill-oasis

Tom's Patreon
https://patreon.com/menaregood

Tom's web site
https://menaregood.com

Twitter
@trgolden

Donate
https://menaregood.com/wordpress/donations

Toms Books info:  
https://menaregood.com/wordpress/tomsbooks


Swallow

In [1227]:
def remove_url_lines(text):
    lines = text.split('\n')  # Split the text into lines
    new_lines = [line.strip() for line in lines if not re.search(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', line)]
    cleaned = '\n'.join(new_lines)  # Join the lines back into a single string
    return re.sub('\n+', '\n', cleaned)+'\n'

channel_subset_with_comments['video_description'] = channel_subset_with_comments['video_description'].apply(remove_url_lines)

In [1228]:
x = 0
for i in pd.unique(channel_subset_with_comments['video_description']):
    if x > 5:
        break
    print(i)
    x+=1

Paul, Tom, and Janice discuss a New York Times article about widows of COVID-19, noticing its determination to view the deaths of men through a gynocentric lens. Everything is about the suffering of women, even men's deaths. Then we acknowledge the remarkable work of Mark Perry on gender reality, looking at two charts about the gender gap in occupational fatalities, something that occurs year after year with almost zero public notice.
Tom's Subscribestar
Tom's Patreon
Tom's web site
Twitter
@trgolden
Donate
Toms Books info:
Swallowed by a Snake: The Gift of the Masculine Side of Healing
The Way Men Heal
Helping Mothers be Closer to their Sons: Understanding the Unique World of Boys
Huge Men Are Good Mug
Men Are Good Hat

Tom, Paul, and Janice interview PhD Candidate Deborah Powney about her research on male victims of domestic violence. Deborah conducted an important survey last year and is now seeking participants for a second survey focusing on coercive control. Coercive control in i

In [1229]:
pd.unique(channel_subset_with_comments['channel_title'])

array(['Men Are Good!', 'TomWoodsTV', 'Tony Kriz', 'ToolTime',
       'Tree Of Logic', 'Triggered Newscast', 'TrutherTalk',
       'Turning Point USA', 'TylerPreston20', 'Tipster', 'TJump', 'TMM',
       "Tommy C's SFTP", 'Trae Crowder', "TYT's The Conversation",
       'TZMOfficialChannel', 'UK Column'], dtype=object)

In [1230]:
pd.value_counts(channel_subset_with_comments.loc[channel_subset_with_comments['comment_text'] == '']['channel_title'])

  pd.value_counts(channel_subset_with_comments.loc[channel_subset_with_comments['comment_text'] == '']['channel_title'])


channel_title
Turning Point USA         228
Trae Crowder              114
TrutherTalk                92
Tipster                    58
TYT's The Conversation     25
Tommy C's SFTP             11
UK Column                  10
Triggered Newscast          8
TomWoodsTV                  7
TJump                       5
Men Are Good!               2
TylerPreston20              2
TMM                         2
Tree Of Logic               1
Name: count, dtype: int64

In [1231]:
channel_subset_with_comments = channel_subset_with_comments[channel_subset_with_comments.comment_text != '']
channel_subset_with_comments

Unnamed: 0,channel_id,channel_title,affiliation,video_id,video_title,video_description,comment_text,comment_id
0,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Men's generally poorer health from stress caus...,Ugxtc_P8-7EmyHXAaF14AaABAg
1,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...","Is anyone on gab? I can't find people there, p...",UgyhGIdaUug-1Ciyfbh4AaABAg
2,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",Life sucks doesn't it ladies. He just up and d...,UgwGR7A9eYedizYeCGZ4AaABAg
3,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...",I have seen figures that show less death in th...,UgyIq8h56bQaUI-O55l4AaABAg
4,UCAlVRoYjDbgLx7j6O2N8-2g,Men Are Good!,R,-T9vHzMdir4,Women Rally to Close the Gender Death Gap. NOT.,"Paul, Tom, and Janice discuss a New York Times...","""Waiting for the call for women to protect men...",UgzLSxmvmakux6GzJG14AaABAg
...,...,...,...,...,...,...,...,...
77028,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",Their are lies damn lies and British politicia...,UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9IDbTcSKMxR
77029,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",So many people with large family & friend netw...,UgzziTnfC9kiB5qsyyF4AaABAg.9IAyyudOaoc9ICpFxtI6ro
77030,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...",wanna discuss clap trap Brian?,UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB9FsByn5z
77031,UCr_dQZ0irQdRiwVatkuHPcA,UK Column,L,2yoqlbn9Ts0,UK Column News - 6th January 2021,"Brian Gerrish, Mike Robinson, Alex Thomson and...","Which, when answered correctly leads nicely in...",UgwWdIXwRecT8RRGoZ14AaABAg.9IAyY08RTtV9IB91oH6ug6


In [1232]:
pd.value_counts(channel_subset_with_comments.loc[channel_subset_with_comments['comment_text'] == '']['channel_title'])

  pd.value_counts(channel_subset_with_comments.loc[channel_subset_with_comments['comment_text'] == '']['channel_title'])


Series([], Name: count, dtype: int64)

Let's export this to work with it in another notebook.

In [1233]:
channel_subset_with_comments.to_csv("../data/cleaned/channel_subset_with_comments.csv", sep=',', index=False)