---
# Beginning Data Analysis
---

## Developing a data analysis routine

In [1]:
import numpy as np
import pandas as pd

1. Read in the dataset, and view a sample of rows with the `.sample` method:

In [None]:
 college = pd.read_csv("college.csv")
 college.sample(n=15, random_state=42)

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3649,Career Point College,San Antonio,TX,0.0,0.0,0.0,0,,,0.0,529.0,0.3251,0.3119,0.3629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.9172,0.9172,0.697,20700,14977
1600,Ner Israel Rabbinical College,Baltimore,MD,0.0,1.0,0.0,1,,,0.0,305.0,0.9279,0.0,0.0,0.0,0.0,0.0,0.0,0.0721,0.0,0.0,1,0.2382,0.0,0.0882,PrivacySuppressed,PrivacySuppressed
6742,Reflections Academy of Beauty,Decatur,IL,0.0,0.0,0.0,0,,,0.0,5.0,0.8,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.8621,0.5862,0.3333,,PrivacySuppressed
1467,Capital Area Technical College,Baton Rouge,LA,0.0,0.0,0.0,0,,,0.0,1687.0,0.2833,0.4908,0.0148,0.0047,0.0071,0.0006,0.0053,0.0006,0.1926,0.5673,1,0.2502,0.0,0.4815,26400,PrivacySuppressed
4053,West Virginia University Institute of Technology,Montgomery,WV,0.0,0.0,0.0,0,465.0,500.0,0.0,1115.0,0.7462,0.0691,0.0457,0.0126,0.0045,0.0,0.0287,0.0762,0.017,0.1229,1,0.4092,0.5237,0.2381,43400,23969
4087,Mid-State Technical College,Wisconsin Rapids,WI,0.0,0.0,0.0,0,,,0.0,2531.0,0.904,0.0103,0.0162,0.0253,0.0067,0.0008,0.019,0.0,0.0178,0.6045,1,0.4657,0.4461,0.4819,32000,8025
7495,Strayer University-Huntsville Campus,Huntsville,AL,,,,1,,,,,,,,,,,,,,,1,,,,49200,36173.5
4587,National Aviation Academy of Tampa Bay,Clearwater,FL,0.0,0.0,0.0,0,,,0.0,605.0,0.562,0.1223,0.2364,0.0248,0.0083,0.005,0.0198,0.0198,0.0017,0.0,1,0.6983,0.7296,0.5376,45000,22778
251,University of California-Santa Cruz,Santa Cruz,CA,0.0,0.0,0.0,0,550.0,580.0,0.0,16277.0,0.3465,0.0196,0.3155,0.2035,0.0015,0.0017,0.074,0.0216,0.016,0.0278,1,0.4598,0.5458,0.0447,43000,19884
1426,Lexington Theological Seminary,Lexington,KY,0.0,0.0,0.0,1,,,0.0,,,,,,,,,,,,1,,,,,PrivacySuppressed


2. Get the dimensions of the DataFrame with the `.shape` attribute

In [None]:
college.shape

(7535, 27)

3. List the data type of each column, the number of non-missing values, and memory usage with the `.info` method:

In [None]:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   INSTNM              7535 non-null   object 
 1   CITY                7535 non-null   object 
 2   STABBR              7535 non-null   object 
 3   HBCU                7164 non-null   float64
 4   MENONLY             7164 non-null   float64
 5   WOMENONLY           7164 non-null   float64
 6   RELAFFIL            7535 non-null   int64  
 7   SATVRMID            1185 non-null   float64
 8   SATMTMID            1196 non-null   float64
 9   DISTANCEONLY        7164 non-null   float64
 10  UGDS                6874 non-null   float64
 11  UGDS_WHITE          6874 non-null   float64
 12  UGDS_BLACK          6874 non-null   float64
 13  UGDS_HISP           6874 non-null   float64
 14  UGDS_ASIAN          6874 non-null   float64
 15  UGDS_AIAN           6874 non-null   float64
 16  UGDS_N

4. Get summary statistics for the numerical columns and transpose the DataFrame for
more readable output:

In [None]:
college.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


5. Get summary statistics for the object (string) columns:

In [None]:
college.describe(include=[np.object]).T

Unnamed: 0,count,unique,top,freq
INSTNM,7535,7535,Florida National University-South Campus,1
CITY,7535,2514,New York,87
STABBR,7535,59,CA,773
MD_EARN_WNE_P10,6413,598,PrivacySuppressed,822
GRAD_DEBT_MDN_SUPP,7503,2038,PrivacySuppressed,1510


It is possible to specify the exact quantiles returned from the 
`.describe` method when used
with numeric columns

In [None]:
college.describe(include=[np.number], percentiles=np.linspace(0, 1, 20, endpoint=False)).T

Unnamed: 0,count,mean,std,min,0%,5%,10%,15%,20%,25%,30%,35%,40%,45%,50%,55%,60%,65%,70%,75%,80%,85%,90%,95%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,290.0,430.0,447.4,460.0,470.0,475.0,485.0,493.0,499.0,505.0,510.0,520.0,530.0,540.0,548.0,555.0,570.0,585.0,605.0,665.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,310.0,430.0,453.0,465.0,475.0,482.0,490.0,495.0,503.0,510.0,520.0,525.0,535.0,545.0,555.0,565.0,580.0,600.0,630.0,685.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,0.0,31.65,49.0,68.0,92.0,117.0,148.9,193.0,246.0,321.0,412.5,547.0,736.6,1042.0,1425.3,1929.5,2734.8,4062.45,6512.3,11858.05,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.0,0.013265,0.06879,0.143665,0.20916,0.2675,0.33379,0.394565,0.45372,0.5074,0.5557,0.596945,0.63468,0.6733,0.711,0.747875,0.78508,0.8235,0.86297,0.927315,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.0,0.0,0.00753,0.0177,0.0267,0.036125,0.0457,0.056855,0.06854,0.0833,0.10005,0.1204,0.14638,0.1772,0.21311,0.2577,0.31654,0.40201,0.51571,0.726715,1.0


A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.

In [None]:
pd.read_csv("college_data_dictionary.csv")

Unnamed: 0,column_name,description
0,INSTNM,Institution Name
1,CITY,City Location
2,STABBR,State Abbreviation
3,HBCU,Historically Black College or University
4,MENONLY,0/1 Men Only
5,WOMENONLY,0/1 Women only
6,RELAFFIL,0/1 Religious Affiliation
7,SATVRMID,SAT Verbal Median
8,SATMTMID,SAT Math Median
9,DISTANCEONLY,Distance Education Only


## Reducing memory by changing data types

Changes the data type of one of the object columns from the college dataset to the
special pandas categorical data type to drastically reduce its memory usage.

In [None]:
college.columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [None]:
college.select_dtypes(include=['object']).columns

Index(['INSTNM', 'CITY', 'STABBR', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'], dtype='object')

In [None]:
college.select_dtypes(include=[np.number]).columns

Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'SATVRMID', 'SATMTMID',
       'DISTANCEONLY', 'UGDS', 'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP',
       'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA',
       'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV'],
      dtype='object')

In [None]:
different_cols = ["RELAFFIL", "SATMTMID", "CURROPER", "INSTNM", "STABBR",]

col2 = college.loc[:, different_cols]
col2.head()

Unnamed: 0,RELAFFIL,SATMTMID,CURROPER,INSTNM,STABBR
0,0,420.0,1,Alabama A & M University,AL
1,0,565.0,1,University of Alabama at Birmingham,AL
2,1,,1,Amridge University,AL
3,0,590.0,1,University of Alabama in Huntsville,AL
4,0,430.0,1,Alabama State University,AL


Inspect the data types of each column:

In [None]:
col2.dtypes

RELAFFIL      int64
SATMTMID    float64
CURROPER      int64
INSTNM       object
STABBR       object
dtype: object

Find the memory usage of each column with the `.memory_usage` method

In [None]:
original_memory = col2.memory_usage(deep=True)
original_memory

Index          128
RELAFFIL     60280
SATMTMID     60280
CURROPER     60280
INSTNM      660240
STABBR      444565
dtype: int64

In [None]:
col2.RELAFFIL.nunique()

2

In [None]:
col2.RELAFFIL.unique()

array([0, 1])

In [None]:
col2.CURROPER.nunique()

2

In [None]:
col2.CURROPER.unique()

array([1, 0])

There is no need to use 64 bits for the RELAFFIL and CURROPER columns as they contains only 0 or 1

In [None]:
col2.RELAFFIL = col2.RELAFFIL.astype(np.int8)
col2.CURROPER = col2.CURROPER.astype(np.int8)
col2.dtypes

RELAFFIL       int8
SATMTMID    float64
CURROPER       int8
INSTNM       object
STABBR       object
dtype: object

Find the memory usage of each column again and note the large reduction

In [None]:
col2.memory_usage(deep=True)

Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER      7535
INSTNM      660240
STABBR      444565
dtype: int64

To save even more memory, you will want to consider changing object data types to
categorical if they have a reasonably low cardinality (number of unique values). Let's
first check the number of unique values for both the object columns

In [None]:
col2.select_dtypes(include=['object']).nunique()

INSTNM    7535
STABBR      59
dtype: int64

In [None]:
col2.select_dtypes(include=['object']).nunique() / col2.shape[0]

INSTNM    1.00000
STABBR    0.00783
dtype: float64

The STABBR column is a good candidate to convert to categorical as less than one percent of its values are unique

In [None]:
col2.STABBR = col2.STABBR.astype('category')
col2.dtypes

RELAFFIL        int8
SATMTMID     float64
CURROPER        int8
INSTNM        object
STABBR      category
dtype: object

In [None]:
new_memory = col2.memory_usage(deep=True)
new_memory

Index          128
RELAFFIL      7535
SATMTMID     60280
CURROPER      7535
INSTNM      660699
STABBR       13576
dtype: int64

let's compare the original memory usage with our updated memory usage.


In [None]:
new_memory / original_memory

Index       1.000000
RELAFFIL    0.125000
SATMTMID    1.000000
CURROPER    0.125000
INSTNM      1.000695
STABBR      0.030538
dtype: float64

The RELAFFIL column is, as expected, an eighth of its original size, while the STABBR column has shrunk to just three percent of its original size

## Selecting the smallest of the largest
Create catchy news headlines such as Out of the Top 100 Universities, These 5 have the Lowest Tuition, or From the Top 50 Cities to Live, these 10 are the Most Affordable.

During analysis, it is possible that you will first need to find a grouping of data that contains the top n values in a single column and, from this subset, find the bottom m values based on a different column.
We find the five lowest budget movies from the top 100 scoring movies by taking advantage of the convenience methods: `.nlargest` and `.nsmallest`.

Read in the movie dataset, and select the columns: movie_title, imdb_score,
and budget:

In [2]:
movie = pd.read_csv("./movie.csv")
movie2 = movie[['movie_title', 'imdb_score', 'budget']]
movie2.head()

Unnamed: 0,movie_title,imdb_score,budget
0,Avatar,7.9,237000000.0
1,Pirates of the Caribbean: At World's End,7.1,300000000.0
2,Spectre,6.8,245000000.0
3,The Dark Knight Rises,8.5,250000000.0
4,Star Wars: Episode VII - The Force Awakens,7.1,


Use the `.nlargest` method to select the top 100 movies by imdb_score:

In [None]:
movie2.nlargest(n=100, columns=['imdb_score'])

Unnamed: 0,movie_title,imdb_score,budget
2725,Towering Inferno,9.5,
1920,The Shawshank Redemption,9.3,25000000.0
3402,The Godfather,9.2,6000000.0
2779,Dekalog,9.1,
4312,Kickboxer: Vengeance,9.1,17000000.0
...,...,...,...
4023,Oldboy,8.4,3000000.0
4163,To Kill a Mockingbird,8.4,2000000.0
4395,Reservoir Dogs,8.4,1200000.0
4550,A Separation,8.4,500000.0


Chain the `.nsmallest` method to return the five lowest budget films among those
with a top 100 score

In [None]:
(
    movie2.nlargest(n=100, columns=['imdb_score'])
    .nsmallest(n=5, columns=['budget'])
)

Unnamed: 0,movie_title,imdb_score,budget
4804,Butterfly Girl,8.7,180000.0
4801,Children of Heaven,8.5,180000.0
4706,12 Angry Men,8.9,350000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


## Selecting the largest of each group by sorting

One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. For instance, this would be like finding the highest-rated film of each year or the highest-grossing film by content rating.
To accomplish this task, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.

Find the highest-rated film of each year.

In [None]:
movie.head(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0


In [None]:
df = movie[['movie_title', 'title_year', 'imdb_score']]
df.head()

Unnamed: 0,movie_title,title_year,imdb_score
0,Avatar,2009.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,7.1
2,Spectre,2015.0,6.8
3,The Dark Knight Rises,2012.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,7.1


In [None]:
df.sort_values(by='title_year', ascending=True)

Unnamed: 0,movie_title,title_year,imdb_score
4695,Intolerance: Love's Struggle Throughout the Ages,1916.0,8.0
4833,Over the Hill to the Poorhouse,1920.0,4.8
4767,The Big Parade,1925.0,8.3
2694,Metropolis,1927.0,8.3
4697,The Broadway Melody,1929.0,6.3
...,...,...,...
4683,Heroes,,7.7
4688,Home Movies,,8.2
4704,Revolution,,6.7
4752,Happy Valley,,8.5


In [None]:
# Sorting by year and score
df.sort_values(by=['title_year', 'imdb_score'], ascending=False)

Unnamed: 0,movie_title,title_year,imdb_score
4312,Kickboxer: Vengeance,2016.0,9.1
4277,A Beginner's Guide to Snuff,2016.0,8.7
3798,Airlift,2016.0,8.5
27,Captain America: Civil War,2016.0,8.2
98,Godzilla Resurgence,2016.0,8.2
...,...,...,...
1391,Rush Hour,,5.8
4031,Creature,,5.0
2165,Meet the Browns,,3.5
3246,The Bold and the Beautiful,,3.5


The `.drop_duplicates` method allows to keep only the first row of every
year:

In [None]:
(
    df.sort_values(by=['title_year', 'imdb_score'], ascending=False)
    .drop_duplicates(subset=['title_year'])
)

Unnamed: 0,movie_title,title_year,imdb_score
4312,Kickboxer: Vengeance,2016.0,9.1
3745,Running Forever,2015.0,8.6
4369,Queen of the Mountains,2014.0,8.7
3935,"Batman: The Dark Knight Returns, Part 2",2013.0,8.4
3,The Dark Knight Rises,2012.0,8.5
...,...,...,...
2694,Metropolis,1927.0,8.3
4767,The Big Parade,1925.0,8.3
4833,Over the Hill to the Poorhouse,1920.0,4.8
4695,Intolerance: Love's Struggle Throughout the Ages,1916.0,8.0


## Replicating nlargest with sort_values

Replicate the Selecting the smallest of the largest recipe with the
`.sort_values` method and explore the differences between the two.

In [3]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .nlargest(n=100, columns=['imdb_score'])
    .nsmallest(n=5, columns=['budget'])
 
)

Unnamed: 0,movie_title,imdb_score,budget
4804,Butterfly Girl,8.7,180000.0
4801,Children of Heaven,8.5,180000.0
4706,12 Angry Men,8.9,350000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


Use `.sort_values` to replicate the first part of the expression and grab the first 100
rows with the `.head` method:

In [5]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .sort_values(by=['imdb_score'], ascending=False)
    .head(100)
)

Unnamed: 0,movie_title,imdb_score,budget
2725,Towering Inferno,9.5,
1920,The Shawshank Redemption,9.3,25000000.0
3402,The Godfather,9.2,6000000.0
2779,Dekalog,9.1,
4312,Kickboxer: Vengeance,9.1,17000000.0
...,...,...,...
3799,Anne of Green Gables,8.4,
3777,Requiem for a Dream,8.4,4500000.0
3935,"Batman: The Dark Knight Returns, Part 2",8.4,3500000.0
4636,The Other Dream Team,8.4,500000.0


Now that we have the top 100 scoring movies, we can use `.sort_values` with `.head` again to grab the lowest five by budget

In [6]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .sort_values(by=['imdb_score'], ascending=False)
    .head(100)
    .sort_values(by=['budget'], ascending=True)
    .head(5)
)

Unnamed: 0,movie_title,imdb_score,budget
4815,A Charlie Brown Christmas,8.4,150000.0
4801,Children of Heaven,8.5,180000.0
4804,Butterfly Girl,8.7,180000.0
4706,12 Angry Men,8.9,350000.0
4636,The Other Dream Team,8.4,500000.0


Are they the same? No! What happened? To understand why the two results are
not equivalent, let's look at the tail of the intermediate steps of each recipe

In [7]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .nlargest(n=100, columns=['imdb_score'])
    .tail()
)

Unnamed: 0,movie_title,imdb_score,budget
4023,Oldboy,8.4,3000000.0
4163,To Kill a Mockingbird,8.4,2000000.0
4395,Reservoir Dogs,8.4,1200000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


In [9]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .sort_values(by=['imdb_score'], ascending=False)
    .head(100)
    .tail()
)

Unnamed: 0,movie_title,imdb_score,budget
3799,Anne of Green Gables,8.4,
3777,Requiem for a Dream,8.4,4500000.0
3935,"Batman: The Dark Knight Returns, Part 2",8.4,3500000.0
4636,The Other Dream Team,8.4,500000.0
2455,Aliens,8.4,18500000.0


The issue arises because more than 100 movies exist with a rating of at least 8.4. Each of the methods, `.nlargest` and `.sort_values`, breaks ties differently, which results in a slightly different 100-row DataFrame. If you pass in *kind='mergsort'* to the `.sort_values ` method, you will get the same result as `.nlargest`.

In [10]:
(
    movie[['movie_title', 'imdb_score', 'budget']]
    .sort_values(by=['imdb_score'], ascending=False, kind='mergsort')
    .head(100)
    .tail()
)

Unnamed: 0,movie_title,imdb_score,budget
4023,Oldboy,8.4,3000000.0
4163,To Kill a Mockingbird,8.4,2000000.0
4395,Reservoir Dogs,8.4,1200000.0
4550,A Separation,8.4,500000.0
4636,The Other Dream Team,8.4,500000.0


## Calculating a trailing stop order price

In [14]:
!pip install requests-cache --quiet

In [15]:
import datetime
import pandas_datareader.data as web
import requests_cache

To get started, we will work with Tesla Motors (TSLA) stock and presume a purchase on the first trading day of 2021

In [20]:
session = requests_cache.CachedSession(
   cache_name='cache', backend='sqlite', 
   expire_after=datetime.timedelta(days=90))

# just add headers to your session and provide it to the reader
session.headers = {     
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',   
    'Accept': 'application/json;charset=utf-8'     }

tsla = web.DataReader('tsla', data_source='yahoo',
   start='2021-9-1', session=session)
tsla.head(8)



Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-09-01,741.98999,731.27002,734.080017,734.090027,13204300,734.090027
2021-09-02,740.969971,730.539978,734.5,732.390015,12777300,732.390015
2021-09-03,734.0,724.200012,732.25,733.570007,15246100,733.570007
2021-09-07,760.200012,739.26001,740.0,752.919983,20039800,752.919983
2021-09-08,764.450012,740.77002,761.580017,753.869995,18793000,753.869995
2021-09-09,762.099976,751.630005,753.409973,754.859985,14077700,754.859985
2021-09-10,762.609985,734.52002,759.599976,736.27002,15114300,736.27002
2021-09-13,744.780029,708.849976,740.210022,743.0,22952500,743.0


For simplicity, we will work with the closing price of each trading day

In [21]:
tsla_close = tsla["Close"]

Use the `.cummax` method to track the highest closing price until the current date

In [22]:
tsla_cummax = tsla_close.cummax()
tsla_cummax.head()

Date
2021-09-01    734.090027
2021-09-02    734.090027
2021-09-03    734.090027
2021-09-07    752.919983
2021-09-08    753.869995
Name: Close, dtype: float64

To limit the downside to 10%, we multiply the result by 0.9. This creates the trailing stop order. We will chain all of the steps together

In [23]:
(
    tsla_close.cummax()
    .mul(.9)
    .head()
)

Date
2021-09-01    660.681024
2021-09-02    660.681024
2021-09-03    660.681024
2021-09-07    677.627985
2021-09-08    678.482996
Name: Close, dtype: float64