In [11]:
import pandas as pd
df_rt_info = pd.read_csv('zippedData/rt.movie_info.tsv.gz', sep='\t', compression='gzip')
df_rt_rev = pd.read_csv('zippedData/rt.reviews.tsv.gz', sep='\t', compression='gzip', encoding='iso-8859-1')

Looking at these two dataframes, I want to join review data to the movie data, but this would be a many-to-one join and would make an extremely large and unworkable merge. Therefore, I want to create an indicator of average review for each movie so I can add columns to the rt_info dataframe. After previewing the 'rating' column, I can see that the scale of ratings is different in many instances, and some are numerical while some are not. Since it would be pretty messy to try and make this column workable, I am instead going to use the 'fresh' column to get a percentage of overall positive reviews. 

In [12]:
# insert a new column with a value of zero 
df_rt_rev['fresh_ind'] = 0

# loop through the data and input a 1 where the rating is 'fresh'
for row in df_rt_rev.index:
    if df_rt_rev['fresh'][row] == 'fresh':
        df_rt_rev['fresh_ind'][row] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [13]:
# pivot the data to look at aggregate values by movie: total ratings (count), and total positive ratings (sum)
df_grouped = df_rt_rev.groupby(['id'])['fresh_ind'].agg(['count', 'sum'])

#insert a column to get a % of total positive reviews
df_grouped['fresh_pct'] = df_grouped['sum'] / df_grouped['count']
df_grouped.reset_index()
df_grouped.head(2)

Unnamed: 0_level_0,count,sum,fresh_pct
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,163,103,0.631902
5,23,18,0.782609


There are fewer movies in the reviews df, so I am going to do an inner join so that I can get only results where there are reviews. 

In [14]:
rt_all = pd.merge(df_rt_info, df_grouped, how='inner', on='id')

In [15]:
#create a column for release year by changing the current dates to datetime and separating
rt_all['theater_date'] = pd.to_datetime(rt_all['theater_date'])
rt_all['release_year'] = rt_all['theater_date'].dt.year

#fill nas so I can work with the data
rt_all['release_year'] = rt_all['release_year'].fillna(0)

#dates are coming up as floats so I am changing them to integers
rt_all = rt_all.astype({'release_year': 'int64'})

In [16]:
rt_all.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,count,sum,fresh_pct,release_year
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,"Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,163,103,0.631902,2012
1,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,1996-09-13,"Apr 18, 2000",,,116 minutes,,23,18,0.782609,1996
2,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,1994-12-09,"Aug 27, 1997",,,128 minutes,,57,32,0.561404,1994
3,8,The year is 1942. As the Allies unite overseas...,PG,Drama|Kids and Family,Jay Russell,Gail Gilchriest,2000-03-03,"Jul 11, 2000",,,95 minutes,Warner Bros. Pictures,75,56,0.746667,2000
4,10,Some cast and crew from NBC's highly acclaimed...,PG-13,Comedy,Jake Kasdan,Mike White,2002-01-11,"Jun 18, 2002",$,41032915.0,82 minutes,Paramount Pictures,108,50,0.462963,2002


There are significantly fewer results in this data than in the other sources that I am using, so I am extending the date range back to 2010.

In [17]:
rt_filtered = rt_all[(rt_all['release_year'] >= 2010)]

In [18]:
#analyze reviews by genre, studio
rt_filtered.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,count,sum,fresh_pct,release_year
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,"Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,163,103,0.631902,2012
6,14,"""Love Ranch"" is a bittersweet love story that ...",R,Drama,Taylor Hackford,Mark Jacobson,2010-06-30,"Nov 9, 2010",$,134904.0,117 minutes,,42,6,0.142857,2010
12,23,A fictional film set in the alluring world of ...,R,Drama,,,2013-12-20,"Mar 18, 2014",$,99165609.0,129 minutes,Sony Pictures,233,213,0.914163,2013
14,25,"From ancient Japan's most enduring tale, the e...",PG-13,Action and Adventure|Drama|Science Fiction and...,Carl Erik Rinsch,Chris Morgan|Hossein Amini,2013-12-25,"Apr 1, 2014",$,20518224.0,127 minutes,Universal Pictures,37,4,0.108108,2013
31,54,Journalist Jep Gambardella (the dazzling Toni ...,NR,Comedy|Drama,Paolo Sorrentino,Paolo Sorrentino|Umberto Contarello,2013-11-15,"Mar 25, 2014",,,142 minutes,Janus Films,106,95,0.896226,2013


In [22]:
genres = rt_filtered['genre'].str.split("|", n = 1, expand = True)
rt_filtered['genre1']= genres[0]
rt_filtered['genre2']= genres[1]
genres2 = rt_filtered['genre2'].str.split("|", n = 1, expand = True)
rt_filtered['genre2']= genres2[0]
rt_filtered['genre3']= genres2[1]
rt_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio,count,sum,fresh_pct,release_year,genre1,genre2,genre3
0,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,2012-08-17,"Jan 1, 2013",$,600000.0,108 minutes,Entertainment One,163,103,0.631902,2012,Drama,Science Fiction and Fantasy,
6,14,"""Love Ranch"" is a bittersweet love story that ...",R,Drama,Taylor Hackford,Mark Jacobson,2010-06-30,"Nov 9, 2010",$,134904.0,117 minutes,,42,6,0.142857,2010,Drama,,
12,23,A fictional film set in the alluring world of ...,R,Drama,,,2013-12-20,"Mar 18, 2014",$,99165609.0,129 minutes,Sony Pictures,233,213,0.914163,2013,Drama,,
14,25,"From ancient Japan's most enduring tale, the e...",PG-13,Action and Adventure|Drama|Science Fiction and...,Carl Erik Rinsch,Chris Morgan|Hossein Amini,2013-12-25,"Apr 1, 2014",$,20518224.0,127 minutes,Universal Pictures,37,4,0.108108,2013,Action and Adventure,Drama,Science Fiction and Fantasy
31,54,Journalist Jep Gambardella (the dazzling Toni ...,NR,Comedy|Drama,Paolo Sorrentino,Paolo Sorrentino|Umberto Contarello,2013-11-15,"Mar 25, 2014",,,142 minutes,Janus Films,106,95,0.896226,2013,Comedy,Drama,


In [None]:
#add loop for genre analysis