# Question 1
PART 1: (1 point)

Load the MovieData.csv dataset into a pandas DataFrame as described in this week's lesson, and use it to find the following values:

a. What is the median profit of movies with budgets of over $50M?

b. How many movies were released by each film distributor? Output the results to a csv file.

PART 2: (6 points)

c. What are the mean and median movie profits by decade? Which decade was the most profitable? (Hint: Answering this question requires several steps: grouping the movies by decade, computing the mean and median profits for each decade, and combining the results back together.)

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
%matplotlib inline 

In [2]:
def make_date(date_str):
    '''
    Turn a MM/DD/YY string into a datetime object
    '''
    m, d, y = date_str.split("/")
    m = int(m)
    d = int(d)
    y = int(y)
    if y > 13:
        y += 1900
    else:
        y += 2000
    return dt.datetime(y, m, d)

In [3]:
movies = pd.read_csv("MovieData.csv", sep='\t', na_values=["Unknown", "Unkno"], 
                         parse_dates=[0], date_parser=make_date)

In [4]:
movies.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross
0,2012-03-09,John Carter,,300000000,66439100.0,254439100.0
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996492.0
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,,
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,,
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581936.0


In [5]:
# Replace missing values with zeros
movies.fillna(0, inplace=True)

In [6]:
movies.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross
0,2012-03-09,John Carter,0,300000000,66439100.0,254439100.0
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996492.0
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,0.0,0.0
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,0.0,0.0
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581936.0


In [7]:
print("The date of the oldest movie in the dataset is %r." % min(movies["Release_Date"]))
print("The date of the newest movie in the dataset is %r." % max(movies["Release_Date"]))

The date of the oldest movie in the dataset is Timestamp('1915-02-08 00:00:00').
The date of the newest movie in the dataset is Timestamp('2013-12-13 00:00:00').


In [8]:
# Fill in Worldwide Gross when it is zero
movies["Worldwide Gross"][movies["Worldwide Gross"]==0] = movies["US Gross"]
#movies["US Gross"][movies["US Gross"]==0] = movies["Worldwide Gross"]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [9]:
movies["Profits"] = movies["Worldwide Gross"] - movies["Budget"]
movies.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross,Profits
0,2012-03-09,John Carter,0,300000000,66439100.0,254439100.0,-45560900.0
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996492.0,660996492.0
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,0.0,0.0,-270000000.0
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,0.0,0.0,-270000000.0
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581936.0,326581936.0


In [10]:
#movies[(movies.Budget > 50000000) & (movies["US Gross"].notnull())]
bigger_budget = movies[movies.Budget > 50000000]

In [11]:
#movies[(movies.Budget > 50000000)].median()
bigger_budget.Profits.median()

89246220.0

### Answer to 1 a.
#### a. What is the median profit of movies with budgets of over 50M?

In [12]:
print("The median profit of movies with budgets of over $50M is %d." %
     bigger_budget.Profits.median())

The median profit of movies with budgets of over $50M is 89246220.


In [13]:
distributors = movies.groupby("Distributor").aggregate(len)
distributor_count = distributors["Movie"]
distributor_count

Distributor
0                                  659
20th Century Fox                   230
3D Entertainment                     1
8 X Entertainment                    1
ART                                  1
Access                               1
Alliance                             4
American International Pictures      1
Anchor Bay                           4
Apparition                           4
Artisan                             23
Artistic License                     1
Atlantic                             1
Attitude Films                       1
Avatar                               1
Avco Embassy                         5
Barking Cow                          1
Big Pictures                         1
Bigger Picture                       1
Black Diamond Pictures               1
Buena Vista                        227
CBS Films                            3
CFP                                  1
CHRIST                               1
Cannon                               4
Cinema Servic

### Answer to Question 1 b.
#### b. How many movies were released by each film distributor? 
## NOTE: See below for outputting of the results to a csv file.

In [14]:
print("The number of movies released by each film distributor is shown below:")
print(distributor_count)

The number of movies released by each film distributor is shown below:
Distributor
0                                  659
20th Century Fox                   230
3D Entertainment                     1
8 X Entertainment                    1
ART                                  1
Access                               1
Alliance                             4
American International Pictures      1
Anchor Bay                           4
Apparition                           4
Artisan                             23
Artistic License                     1
Atlantic                             1
Attitude Films                       1
Avatar                               1
Avco Embassy                         5
Barking Cow                          1
Big Pictures                         1
Bigger Picture                       1
Black Diamond Pictures               1
Buena Vista                        227
CBS Films                            3
CFP                                  1
CHRIST              

In [15]:
type(distributor_count)

pandas.core.series.Series

In [16]:
data = pd.DataFrame(distributor_count)
data.head()

Unnamed: 0_level_0,Movie
Distributor,Unnamed: 1_level_1
0,659
20th Century Fox,230
3D Entertainment,1
8 X Entertainment,1
ART,1


In [17]:
data.index

Index([                                0,                '20th Century Fox',
                      '3D Entertainment',               '8 X Entertainment',
                                   'ART',                          'Access',
                              'Alliance', 'American International Pictures',
                            'Anchor Bay',                      'Apparition',
       ...
                         'Weinstein Co.',             'Weinstein/Dimension',
                             'Weintraub',                      'WellSpring',
                            'Wellspring',                         'WinStar',
                               'Winstar',                        'Yash Raj',
                             'Zeitgeist',                            'Zion'],
      dtype='object', name='Distributor', length=209)

In [18]:
data.columns = ["Counts"]
data

Unnamed: 0_level_0,Counts
Distributor,Unnamed: 1_level_1
0,659
20th Century Fox,230
3D Entertainment,1
8 X Entertainment,1
ART,1
Access,1
Alliance,4
American International Pictures,1
Anchor Bay,4
Apparition,4


### Outputting the results to a csv file.

In [19]:
data.to_csv('hw3_1b.csv', sep=',') # Outputs the results to a csv file.

#### What are the mean and median movie profits by decade? Which decade was the most profitable? (Hint: Answering this question requires several steps: grouping the movies by decade, computing the mean and median profits for each decade, and combining the results back together.)

In [20]:
movies.Release_Date[19].year

2005

In [21]:
movies["Year"] = movies.Release_Date.apply(lambda x: x.year)
movies.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross,Profits,Year
0,2012-03-09,John Carter,0,300000000,66439100.0,254439100.0,-45560900.0,2012
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996492.0,660996492.0,2007
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,0.0,0.0,-270000000.0,2013
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,0.0,0.0,-270000000.0,2012
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581936.0,326581936.0,2010


In [22]:
print(movies.Year.min())
print(movies.Year.max())

1915
2013


In [23]:
by_year = movies.groupby("Year")
print(by_year.groups.keys())

dict_keys([1915, 1916, 1920, 1925, 1927, 1929, 1930, 1931, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013])


In [24]:
by_year.sum().head()

Unnamed: 0_level_0,Budget,US Gross,Worldwide Gross,Profits
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1915,110000,10000000.0,11000000.0,10890000.0
1916,585907,8000000.0,8000000.0,7414093.0
1920,100000,3000000.0,3000000.0,2900000.0
1925,4145000,20000000.0,31000000.0,26855000.0
1927,2000000,0.0,0.0,-2000000.0


In [25]:
# Grouping the movies by decade
movies["Decade"] = movies.Year.apply(lambda x: (x //10 * 10))
movies["Decade"].unique()

array([2010, 2000, 1990, 1980, 1970, 1960, 1950, 1940, 1930, 1920, 1910],
      dtype=int64)

In [26]:
by_decade = movies.groupby("Decade")
by_decade.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross,Profits,Year,Decade
0,2012-03-09,John Carter,0,300000000,66439100.0,254439100.0,-45560900.0,2012,2010
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996500.0,660996500.0,2007,2000
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,0.0,0.0,-270000000.0,2013,2010
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,0.0,0.0,-270000000.0,2012,2010
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581900.0,326581900.0,2010,2010
5,2007-05-04,Spider-Man 3,Sony,258000000,336530303.0,890875300.0,632875300.0,2007,2000
6,2009-07-15,Harry Potter and the Half-Blood Prince,Warner Bros.,250000000,301959197.0,934416500.0,684416500.0,2009,2000
7,2011-05-20,Pirates of the Caribbean: On Stranger Tides,Buena Vista,250000000,241063875.0,1043664000.0,793663900.0,2011,2010
9,2009-12-18,Avatar,20th Century Fox,237000000,760507625.0,2783919000.0,2546919000.0,2009,2000
10,2006-06-28,Superman Returns,Warner Bros.,232000000,200120000.0,390874000.0,158874000.0,2006,2000


In [27]:
# Computing the mean and median profits for each decade
by_decade_mean = by_decade.Profits.mean()
by_decade_median = by_decade.Profits.median()

### Answers to Question 1 c.
#### c. What are the mean and median movie profits by decade? Which decade was the most profitable? (Hint: Answering this question requires several steps: grouping the movies by decade, computing the mean and median profits for each decade, and combining the results back together.)

In [28]:
print("The mean movie profits by decade were:")
by_decade_mean

The mean movie profits by decade were:


Decade
1910    6.101364e+06
1920    6.346800e+06
1930    3.892876e+07
1940    1.025301e+07
1950    1.816625e+07
1960    2.845890e+07
1970    6.358547e+07
1980    5.114162e+07
1990    5.751548e+07
2000    5.318013e+07
2010    6.331232e+07
Name: Profits, dtype: float64

In [29]:
print("The median movie profits by decade were:")
by_decade_median

The median movie profits by decade were:


Decade
1910     7800000.0
1920     3979000.0
1930     2265500.0
1940     6012000.0
1950     8690000.0
1960    10564923.0
1970    19533200.0
1980    16168359.0
1990     9133087.0
2000     8762690.0
2010     8626300.0
Name: Profits, dtype: float64

In [30]:
print("The most profitable decade for movie profits was the 1970s with %f in profits." %
      by_decade_mean.max())

The most profitable decade for movie profits was the 1970s with 63585471.388350 in profits.


# Question 2
### PART 1: (1 point)

Load the earthquake data in QuakeData.csv into a DataFrame, and use it to answer the following questions:

a. What is the median earthquake magnitude?

b. What is the correlation between magnitude and depth?

In [31]:
#earthquakes = pd.read_csv("QuakeData.csv", sep=',')
earthquakes = pd.read_csv("QuakeData.csv", parse_dates=[0])

In [32]:
earthquakes.head()

Unnamed: 0,DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID,Version
0,2012-01-01 00:30:08.770,12.008,143.487,35.0,5.1,mb,178,45,,1.2,pde,pde20120101003008770_35,1363392487731
1,2012-01-01 00:43:42.770,12.014,143.536,35.0,4.4,mb,29,121,,0.98,pde,pde20120101004342770_35,1363392488431
2,2012-01-01 00:50:08.040,-11.366,166.218,67.5,5.3,mb,143,43,,0.82,pde,pde20120101005008040_67,1363392488479
3,2012-01-01 01:22:07.660,-6.747,130.008,145.0,4.2,mb,14,112,,1.16,pde,pde20120101012207660_145,1363392488594
4,2012-01-01 02:35:21.110,23.472,91.834,27.8,4.6,mb,74,77,,0.65,pde,pde20120101023521110_27,1363392488611


In [33]:
earthquakes.dtypes

DateTime      datetime64[ns]
Latitude             float64
Longitude            float64
Depth                float64
Magnitude            float64
MagType               object
NbStations             int64
Gap                    int64
Distance             float64
RMS                  float64
Source                object
EventID               object
Version                int64
dtype: object

In [34]:
earthquakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 13 columns):
DateTime      12684 non-null datetime64[ns]
Latitude      12684 non-null float64
Longitude     12684 non-null float64
Depth         12676 non-null float64
Magnitude     12684 non-null float64
MagType       12684 non-null object
NbStations    12684 non-null int64
Gap           12684 non-null int64
Distance      1 non-null float64
RMS           10544 non-null float64
Source        12684 non-null object
EventID       12684 non-null object
Version       12684 non-null int64
dtypes: datetime64[ns](1), float64(6), int64(3), object(3)
memory usage: 1.3+ MB


In [35]:
earthquakes.Magnitude.describe()

count    12684.000000
mean         4.558483
std          0.418082
min          4.000000
25%          4.300000
50%          4.500000
75%          4.800000
max          8.600000
Name: Magnitude, dtype: float64

### Answer to 2 a.
### a. What is the median earthquake magnitude?

In [36]:
print("The median earthquake magnitude for the datset is %f." % earthquakes.Magnitude.median())

The median earthquake magnitude for the datset is 4.500000.


### Answer to 2 b.
### b. What is the correlation between magnitude and depth?

In [37]:
earthquakes_subset = earthquakes[['Magnitude','Depth']]

In [38]:
earthquakes_subset.corr()

Unnamed: 0,Magnitude,Depth
Magnitude,1.0,0.029175
Depth,0.029175,1.0


In [39]:
earthquakes.Depth.corr(earthquakes.Magnitude)

0.02917515915997664

In [40]:
print("The correlation between magnitude and depth is %f." %
      earthquakes.Depth.corr(earthquakes.Magnitude))


The correlation between magnitude and depth is 0.029175.


# Question 2

### PART 2: (7 points)

c. What fraction (not count) of earthquakes happen each month, across all years (i.e. all earthquakes occurring in January as a proportion of the grand total, all earthquakes in February as a proportion of the grand total, etc.)?

d. Is there correlation between the number of movies released monthly (i.e. Jan-1990, Feb-1990... ) , and the number of earthquakes in that month?

In [41]:
earthquakes.head()

Unnamed: 0,DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID,Version
0,2012-01-01 00:30:08.770,12.008,143.487,35.0,5.1,mb,178,45,,1.2,pde,pde20120101003008770_35,1363392487731
1,2012-01-01 00:43:42.770,12.014,143.536,35.0,4.4,mb,29,121,,0.98,pde,pde20120101004342770_35,1363392488431
2,2012-01-01 00:50:08.040,-11.366,166.218,67.5,5.3,mb,143,43,,0.82,pde,pde20120101005008040_67,1363392488479
3,2012-01-01 01:22:07.660,-6.747,130.008,145.0,4.2,mb,14,112,,1.16,pde,pde20120101012207660_145,1363392488594
4,2012-01-01 02:35:21.110,23.472,91.834,27.8,4.6,mb,74,77,,0.65,pde,pde20120101023521110_27,1363392488611


In [42]:
earthquakes["Month"] = earthquakes.DateTime.apply(lambda x: x.month)
earthquakes.columns

Index(['DateTime', 'Latitude', 'Longitude', 'Depth', 'Magnitude', 'MagType',
       'NbStations', 'Gap', 'Distance', 'RMS', 'Source', 'EventID', 'Version',
       'Month'],
      dtype='object')

In [43]:
#by_month = earthquakes.groupby("Month")
#by_month.head()

In [44]:
quakes = earthquakes[["Month", "DateTime"]]

In [45]:
by_month = quakes.groupby("Month")
by_month.head()

Unnamed: 0,Month,DateTime
0,1,2012-01-01 00:30:08.770
1,1,2012-01-01 00:43:42.770
2,1,2012-01-01 00:50:08.040
3,1,2012-01-01 01:22:07.660
4,1,2012-01-01 02:35:21.110
1005,2,2012-02-01 01:29:24.860
1006,2,2012-02-01 01:54:58.660
1007,2,2012-02-01 02:43:19.000
1008,2,2012-02-01 04:26:14.450
1009,2,2012-02-01 04:30:47.110


In [46]:
total_quakes_per_month = by_month.aggregate(len)
total_quakes_per_month.columns = ["Number of Earthquakes that Month"]
total_quakes_per_month

Unnamed: 0_level_0,Number of Earthquakes that Month
Month,Unnamed: 1_level_1
1,1024
2,1081
3,1145
4,1393
5,1058
6,900
7,882
8,1022
9,1132
10,1051


### Answer to Question 2 c.
#### c. What fraction (not count) of earthquakes happen each month, across all years (i.e. all earthquakes occurring in January as a proportion of the grand total, all earthquakes in February as a proportion of the grand total, etc.)?

In [47]:
fraction = total_quakes_per_month / total_quakes_per_month["Number of Earthquakes that Month"].sum()
fraction.columns = ["Fraction of Total Earthquakes"]
print("The fractions of earthquakes that happen each month as a proportion " +
      "of the grand total of all earthquakes are shown below: ")
fraction

The fractions of earthquakes that happen each month as a proportion of the grand total of all earthquakes are shown below: 


Unnamed: 0_level_0,Fraction of Total Earthquakes
Month,Unnamed: 1_level_1
1,0.080732
2,0.085225
3,0.090271
4,0.109823
5,0.083412
6,0.070956
7,0.069536
8,0.080574
9,0.089246
10,0.08286


In [48]:
movies["Month"] = movies.Release_Date.apply(lambda x: x.month)
movies.head()

Unnamed: 0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross,Profits,Year,Decade,Month
0,2012-03-09,John Carter,0,300000000,66439100.0,254439100.0,-45560900.0,2012,2010,3
1,2007-05-25,Pirates of the Caribbean: At World's End,Buena Vista,300000000,309420425.0,960996492.0,660996492.0,2007,2000,5
2,2013-12-13,The Hobbit: There and Back Again,New Line,270000000,0.0,0.0,-270000000.0,2013,2010,12
3,2012-12-14,The Hobbit: An Unexpected Journey,New Line,270000000,0.0,0.0,-270000000.0,2012,2010,12
4,2010-11-24,Tangled,Buena Vista,260000000,200821936.0,586581936.0,326581936.0,2010,2010,11


In [49]:
movies_by_month = movies.groupby("Month")
movies_by_month

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000000008F39DA0>

In [50]:
total_movies_per_month = movies_by_month.aggregate(len)
total_movies_per_month

Unnamed: 0_level_0,Release_Date,Movie,Distributor,Budget,US Gross,Worldwide Gross,Profits,Year,Decade
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,256,256,256,256,256.0,256.0,256.0,256,256
2,231,231,231,231,231.0,231.0,231.0,231,231
3,282,282,282,282,282.0,282.0,282.0,282,282
4,288,288,288,288,288.0,288.0,288.0,288,288
5,255,255,255,255,255.0,255.0,255.0,255,255
6,305,305,305,305,305.0,305.0,305.0,305,305
7,282,282,282,282,282.0,282.0,282.0,282,282
8,321,321,321,321,321.0,321.0,321.0,321,321
9,307,307,307,307,307.0,307.0,307.0,307,307
10,367,367,367,367,367.0,367.0,367.0,367,367


In [51]:
total_movies_per_month = total_movies_per_month[["Movie"]]
total_movies_per_month.columns = ["Movies Released Monthly"]

In [52]:
total_movies_per_month

Unnamed: 0_level_0,Movies Released Monthly
Month,Unnamed: 1_level_1
1,256
2,231
3,282
4,288
5,255
6,305
7,282
8,321
9,307
10,367


In [53]:
monthly_data = total_movies_per_month.merge(total_quakes_per_month, how="right",
                                   left_index=True, right_on="Month")
monthly_data.columns

Index(['Movies Released Monthly', 'Number of Earthquakes that Month'], dtype='object')

In [54]:
monthly_data

Unnamed: 0_level_0,Movies Released Monthly,Number of Earthquakes that Month
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1,256,1024
2,231,1081
3,282,1145
4,288,1393
5,255,1058
6,305,900
7,282,882
8,321,1022
9,307,1132
10,367,1051


### Answer to Question 2 d.
#### d. Is there correlation between the number of movies released monthly (i.e. Jan-1990, Feb-1990... ) , and the number of earthquakes in that month?

In [55]:
monthly_data.corr()

Unnamed: 0,Movies Released Monthly,Number of Earthquakes that Month
Movies Released Monthly,1.0,-0.156923
Number of Earthquakes that Month,-0.156923,1.0
