# Data Cleaning

## Research Questions

- How does choice of graduate school major and professional field reflect trends in Netflix movie streaming, based on ratings of different genres?
- Is there a strong enough association to suggest a potential influence of academia portrayed on media while students choose their field of study/career?
- Does the number of newly released movies pertaining to specific fields affect people's choices of majors in graduate school? Similarly, does the average rating of those movies have any effect on major choice?

#### Resources
- https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
- https://stackoverflow.com/questions/57958432/how-to-add-table-title-in-python-preferably-with-pandas
- https://www.geeksforgeeks.org/concatenate-strings-from-several-rows-using-pandas-groupby/ 
- https://stackoverflow.com/questions/61688906/column-in-pandas-series-is-appearing-in-a-row-above-the-rest
- https://www.geeksforgeeks.org/convert-list-like-column-elements-to-separate-rows-in-pandas/
- https://www.geeksforgeeks.org/plot-a-horizontal-line-in-matplotlib/
- https://www.statology.org/cannot-mask-with-non-boolean-array-containing-na-nan-values/
- https://www.tutorialspoint.com/how-to-avoid-overlapping-of-labels-and-autopct-in-a-matplotlib-pie---chart#:~:text=Use%20pie()%20method%20to,figure%2C%20use%20show()%20method.
- https://datascienceparichay.com/article/matplotlib-label-points-on-scatter-plot/
- INFO2950 FA23 HW4: Normalizer() function definition

### Importing python libraries:

In [1]:
import pandas as pd
import duckdb as db
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

### Importing dataframes:

Our group analyzed three different datasets: FiveThirtyEight College Majors Dataset, Netflix Data Analysis Dataset, and Lastest Movie Lens dataset.

### Links to datasets:
- 538 college majors: https://www.kaggle.com/datasets/fivethirtyeight/fivethirtyeight-college-majors-dataset
- Netflix Data Analysis: https://www.kaggle.com/code/ameyagrawal/netflix-data-analysis/input?select=titles.csv
- Movie Lens: https://grouplens.org/datasets/movielens/latest/

In [5]:
movies = pd.read_csv('ml-latest/movies.csv', header = 0)
genome_scores = pd.read_csv('ml-latest/genome-scores.csv', header = 0)
genome_tags = pd.read_csv('ml-latest/genome-tags.csv', header = 0)
ratings = pd.read_csv('ml-latest/ratings.csv', header = 0)
# reading relevant Latest Movie Lens csvs

In [6]:
grad_583 = pd.read_csv('58collegedatasets/grad-students.csv', header = 0)
recentgrads_583 = pd.read_csv('58collegedatasets/recent-grads.csv', header = 0)
# reading relevant 583 college major csvs

In [7]:
titles_netflix = pd.read_csv('kagglenetflix/titles.csv')
# reading the title csv from the Netflix Data Analysis csvs

# Data Cleaning and Explanations

## Netflix Data Analysis df cleaning

In order to extract Netflix movies from all of the Latest Movie Lens df, we need to make a Series that has all the titles of movies on Netflix.

In [8]:
nflix_titles = titles_netflix.filter(items=['title']).fillna('')
# extracts only the movie titles from the df 
nflix_titles.head()
# view first five rows of netflix_titles df

Unnamed: 0,title
0,Five Came Back: The Reference Films
1,Taxi Driver
2,Deliverance
3,Monty Python and the Holy Grail
4,The Dirty Dozen


## Movie Lens Data Cleaning

### Cleaning movies df

In the movies df, if a movie has a release date they are represented as a string in the title col. Therefore, we need to seperate the title cols into a string title col without the release year and a datetime release_year col with the release years of the respective movie.

In [9]:
movies["release_year"] = movies["title"].str[-5:-1]
# Creating a release_year column from year strings in the title column

movies["title"] = movies["title"].str[:-7]
# creating a title column without the year strings

movies["release_year"] = pd.to_numeric(movies["release_year"], 
                                       errors = "coerce")
# convert years to datetime values and convert any movies 
# without release_year data available to NaN values

movies["release_year"] = movies["release_year"].fillna(0)
# convert NaN values to zero, so movies without 
# release_year are removed in next step

Now that the release years are represented as int values in the release_year col, we can filter movies by release year. The data in 583 College Major dfs was collected between 2010 and 2012. The relevant movies will be released between 2000 and 2023.  Then we will convert the release_years left after filtering into datetime dtype.

In [10]:
selected_movies = movies[movies.release_year >= 2000]
selected_movies = selected_movies[movies.release_year <= 2023]
# Restrict the amount of movies to those released between 2000 and 2030

datetime = pd.to_datetime(selected_movies.release_year, 
                          errors = "coerce", 
                          format = '%Y').fillna(0)
# create a series of release_year with its values converted to datetime

selected_movies["release_year"] = datetime.dt.year
# replace release_year int values with their datetime counterparts

  selected_movies = selected_movies[movies.release_year <= 2023]


Now that movies are filtered into recently released movies and release years aer converted to datetime dtype, we need to check if movies from the Latest Movie Lens df are on Netflix. We do this by using the .isin() function and keep the movies with title that are in the nfix_titles series.

In [11]:
seleceted_movies = selected_movies[selected_movies.title.isin(nflix_titles) == True]
# selecting movies that are on Netflix
movie_id2000 = selected_movies['movieId']
# make a series that contains the movieIds 
# of movies released during and after 2000
selected_movies.head()
# view first five rows of selected_movies

Unnamed: 0,movieId,title,genres,release_year
2677,2769,"Yards, The",Crime|Drama,2000
3084,3177,Next Friday,Comedy,2000
3097,3190,Supernova,Adventure|Sci-Fi|Thriller,2000
3132,3225,Down to You,Comedy|Romance,2000
3135,3228,Wirey Spindell,Comedy,2000


We are finally left with a df of movies that are on Netflix and were released between 2000 and 2023.

### Cleaning ratings df

The first step we took to clean the ratings df from Latest Movie Lens was to remove irrelevant data: timestamp, userId, and movies that are (a) released before 2000 and after 2023 and (b) not on Netflix.

In [12]:
ratings_drop = ratings.drop(columns = ["timestamp", "userId"])
# removing the "timestamp" column because length 
# of movie is not relevant to our research question

# removing the "userId" column because user demographics 
# are not provided, which makes users irrelevant


ratings_restricted = ratings_drop[ratings_drop.movieId.isin(movie_id2000) == True]
# restrict the movieId values to those belonging to 
# movies created between 2000 and 2023 and on Netflix

Now that irrelevant data has been removed, we are going to group all user ratings together by movie and find the average movie rating for each movie. This data is then placed into an  avg_rating col and the rating col is dropped from the df. Duplicate rows are also dropped.

In [13]:
avg_rating = ratings_restricted.groupby("movieId", as_index = False)["rating"].mean()
# find average rating of movie with movieId
rating_dropdup = avg_rating.drop_duplicates()
# remove duplicate rows
rating_final = rating_dropdup.rename(columns = {"rating":"avg_rating"})
# rename rating col to avg_rating
rating_final.head()
# view first five rows of the df

Unnamed: 0,movieId,avg_rating
0,2769,3.133129
1,3177,2.865199
2,3190,2.348422
3,3225,2.679353
4,3228,2.125


We are now left with a df that has a row for each movie(between 2000 and 2023, and on Netflix) and its respective average rating.

### Cleaning and combining genome dfs

There are over 300 tags in the tags df in Latest Movie Lens. We sorted throught the tags and created a series containing the selected tags. Selected tags were evaluated in relation to genre, negative/positive connotative words, and related to specific job industries (i.e. "police", "science, "artist, etc.).

In [14]:
rel_tags = ["adventure", "alien", "alien invasion", 
            "aliens", "allegory", "android(s)/cyborg(s)",
            "androids",'animated', 'animation', 'anime',
            'art', 'art house', 'artificial intelligence', 
            'artist', 'artistic', 'artsy', 'astronauts', 'blood', 
            'bloody', 'boarding school', 'boring', 
            'boring!', 'brilliant', 'bullshit history', 
            'bullying', 'business', 'cancer', 'cartoon', 'cia',
            'cinematography', 'claymation', 'clever', 
            'clones', 'college', 'colourful', 'comic', 
            'comic book', 'comic book adaption', 'comics',
            'coming of age', 'coming-of-age', 
            'computer animation', 'computers', 
            'con artists', 'confusing', 'crime', 
            'crime gone awry',
            'cyberpunk', 'cyborgs', 'death', 
            'death penalty', 'detective', 'dinosaurs', 
            'disappointing', 'doctors', 'dumb', 'dumb but funny',
            'dynamic cgi action', 'ecology', 
            'educational', 'entertaining', 'environment', 
            'environmental', 'epic', 'evolution', 'factual', 
            'fairy tale', 'fairy tales', 'fake documentary', 
            'fbi', 'fun', 'fun movie', 'funniest movies', 
            'funny', 'funny as hell', 'future',
            'futuristic', 'geek', 'geeks', 'genetics', 'genius', 
            'global warming', 'good story', 
            'good versus evil', 'gore', 'goretastic', 'gory',
            'graphic design', 'graphic novel', 
            'gratuitous violence', 'gritty', 'gross-out', 
            'gruesome', 'hackers', 'hacking', 'hard to watch',
            'heist', 'horrible', 'hospital', 'hostage', 
            'idiotic', 'imdb top 250', 'insanity', 'intellectual', 'intelligent',
            'intelligent sci-fi','lame', 'lawyer', 
            'lawyers', 'mad scientist', 'man versus machine', 
            'manipulation', 'masterpiece', 'math',
            'mathematics', 'mental hospital', 'mental illness', 
            'metaphysics', 'military', 'mindfuck', 'mining', 
            'murder', 'murder mystery',
            'music', 'music business', 'musical', 'musicians', 
            'mutants', 'mystery', 'mythology', 'narrated', 
            'nasa', 'nerds', 'nuclear', 'nuclear bomb',
            'nuclear war', 'organized crime', 'oscar', 
            'oscar (best animated feature)', 
            'oscar (best effects - visual effects', 'oscar (best picture)', 
            'oscar winner', 'overrated', 'perfect', 'pointless', 
            'police', 'police corruption', 'police investigation', 
            'political',
            'political corruption', 'politics', 'prison', 
            'prison escape', 'private detective', 'realistic', 
            'robot', 'robots', 'saturn award (best science fiction film)', 
            'scary', 'school', 'sci fi', 'sci-fi', 'science', 
            'science fiction', 'scifi', 
            'scifi cult', 'secret service', 'serial killer', 'simple', 
            'space', 'spies', 'spy', 'spying', 'stop motion', 'stop-motion',
            'studio ghibli', 'stupid', 'stupid as hell', 'stupidity', 
            'suprisingly clever', 'teacher', 'tear jerker', 'technology', 'teen',
            'teen movie', 'teenager', 'teenagers', 'teens', 'teleportation', 
            'television', 'terminal illness', 'terrorism', 'too long', 
            'too short',
            'true story', 'undercover cop', 'united nations', 'unrealistic', 
            'very interesting', 'video game','violence', 'violent', 
            'visceral', 'war', 'war movie', 'wartime', 'waste of time', 
            'weapons', 'working class', "workplace" ]

print(len(rel_tags))
tags_maj = ["adventure", "alien", "alien invasion", "aliens", 
            "allegory", "android(s)/cyborg(s)","androids",
            'animated', 'animation', 'anime', 'art', 
            'art house', 'artificial intelligence', 'artist', 
            'artistic', 'artsy', 'astronauts', 'blood', 
            'bloody', 'boarding school', 'brilliant', 'bullshit history', 
            'bullying', 'business', 'cancer', 'cartoon', 'cia',
            'cinematography', 'claymation', 'clones', 'college', 
            'comic', 'comic book', 'comic book adaption', 'comics',
            'computer animation', 'computers', 'con artists', 'crime', 
            'crime gone awry',
            'cyberpunk', 'cyborgs', 'death', 'death penalty', 
            'detective', 'dinosaurs', 'doctors', 
            'dynamic cgi action', 'ecology', 'educational', 
            'entertaining', 'environment', 'environmental',
            'epic', 'evolution', 'factual', 
            'fairy tale', 'fairy tales', 'fake documentary', 
            'fbi', 'funniest movies', 'funny', 
            'funny as hell', 'future',
             'genetics', 'genius', 'global warming', 
            'good versus evil', 'gore', 'goretastic', 
            'gory', 
            'graphic design', 
            'graphic novel', 'gross-out', 
            'gruesome', 'hackers', 'hacking', 
            'heist', 'hospital', 'hostage', 'insanity',
            'intelligent sci-fi','lame', 'lawyer',
            'lawyers', 'mad scientist', 'man versus machine',  'math',
            'mathematics', 'mental hospital', 'mental illness',
            'metaphysics', 'military', 'mining', 'murder', 'murder mystery',
            'music', 'music business', 'musical', 'musicians', 
            'mutants', 'mystery', 'mythology', 'narrated',
            'nasa', 'nuclear', 'nuclear bomb',
            'nuclear war', 'organized crime', 
            'police', 'police corruption', 'police investigation', 'political',
            'political corruption', 'politics', 'prison',
            'prison escape', 'private detective',
            'robot', 'robots', 'school', 'sci fi', 'sci-fi',
            'science', 'science fiction', 'scifi', 
            'scifi cult', 'secret service', 'serial killer', 
            'space', 'spies', 'spy', 'spying', 'stop motion', 'stop-motion',
            'studio ghibli', 'teacher',  'technology', 
            'teleportation', 'television', 'terminal illness', 'terrorism',
            'true story', 'undercover cop', 'united nations',  'video game',
             'war', 'war movie', 'wartime', 'weapons', 'working class', "workplace" ]
print(len(tags_maj))

210
152


We now need to join genome_scores and genome_tags on their common col of tag_id. We then remove duplicated or irrelevant data (not released between 2000 and 2023, not on Netlix, tags not in rel_tags series, and the duplicated tagId col created during the SQL JOIN).

In [15]:
genome_combined = db.sql("SELECT * FROM genome_scores FULL JOIN genome_tags \
ON genome_scores.tagId = genome_tags.tagId").df()
# combining genome_scores and genome_tag dfs on their common tagId column
genome_final = genome_combined.drop(columns = "tagId_2")
# remove duplicate tagId column
genome_final = genome_final[genome_final.movieId.isin(movie_id2000) == True]
# remove rows that have movies that came out before 2000 and after 2023
genome_final = genome_final[genome_final.tag.isin(rel_tags) == True]
# remove rows with irrelevant tags

Here we must evaluate the relevance of tags. We decided that a tag for a movie is considered relevant if its relevance col value was greater or equal to .84. Any tag rows that did not have a relevannce greater than or equal to .84 was removed from the df. The tagId and relevance cols are no longer relevant because their values have been accounted for, so they are removed from the df.

In [16]:
genome_tag_relevant = genome_final[genome_final.relevance >= .84]
# If relevance of a tag for a movie is greater than .84 
# it is placed into the genome_tag_relevant df
genome_tag_relevant = genome_tag_relevant.drop(columns = ["tagId", "relevance"])
# remove tagId and relevance cols from genome_tag_relevant

To make the data tidier, we feel that is is necessary to combine all individual tag values into one row based on their respective movie. Duplicate rows are then removed.

In [17]:
genome_tag_relevant["tags"] = genome_tag_relevant.groupby(['movieId'])['tag'].transform(lambda x : ', '.join(x))  
# combine relevant tags into one row
relevant_tags_drop = genome_tag_relevant.drop(columns = "tag")
relevant_tags_rename = relevant_tags_drop.rename(columns = {"tags":"relevant_tags"})
# create a new df that is a copy of genome_tag_relevant without the tag col
relevant_tags = relevant_tags_rename.drop_duplicates()
# remove duplicate rows after tags were combined into one row
print(relevant_tags.head())
# view first five rows of relevant df

        movieId                                      relevant_tags
786041     3177                funny, prison escape, funny as hell
801140     3190  alien, future, futuristic, sci fi, sci-fi, sci...
821370     3326                                      alien, aliens
844564     3354  alien, aliens, astronauts, nasa, sci fi, sci-f...
895099     3408                 lawyer, lawyers, oscar, true story


Finally, we are left with a relevant_tags df that has movies with movieID, relevant tags, and filtered for year and availability on Netflix.

## Combining all ml_datasets

Now that we have curated and cleaned all relevant Latest Movie Lens dfs, we need to combine them into one df to make it easier to analyze their relationship to majors of similar genre/relevant tag.

In [18]:
genome_movies = db.sql('SELECT * FROM selected_movies FULL\
                       JOIN relevant_tags ON \
                       selected_movies.movieId = relevant_tags.movieId').df()
genome_movies = genome_movies.drop(columns = "movieId_2")
# combining genome_tag_final and selected_movies on their common movieId column

In [19]:
genome_movies_ratings = db.sql('SELECT * FROM rating_final JOIN \
genome_movies ON genome_movies.movieId = rating_final.movieId').df()
# combining rating_final and genome_movies on their common movieId col

In [20]:
ml_drops = genome_movies_ratings.drop(columns = ["movieId", "movieId_2"])
# remove movieId and movieId_2 columns, because they are no longer relevant,
# as they were only useful in combining dfs
ml_final = ml_drops
ml_final["relevan_tags"] = ml_final["relevant_tags"].fillna("None")
ml_final.head()

# In the future we would like to create seperate columns for each genre that a movie is in.

Unnamed: 0,avg_rating,title,genres,release_year,relevant_tags,relevan_tags
0,3.133129,"Yards, The",Crime|Drama,2000,"crime, political corruption","crime, political corruption"
1,2.865199,Next Friday,Comedy,2000,"funny, prison escape, funny as hell","funny, prison escape, funny as hell"
2,2.679353,Down to You,Comedy|Romance,2000,"teen, teen movie","teen, teen movie"
3,2.289954,Isn't She Great?,Comedy,2000,cancer,cancer
4,2.449003,Scream 3,Comedy|Horror|Mystery|Thriller,2000,"scary, stupid as hell","scary, stupid as hell"


Now we are left with a cleaned and curated final Latest Movie Lens df with data relevant to our research question: 
- avg_rating (gauging interest/favorabiliy of a movie)
- title of movie
- genres (allows us to look at genres related to jo industries such as Crime, which is related to law enforecement majors)
- release_year (to look at different movies released that could have effected grad students and recent graduates career/major choices)
- relevant_tags (further sort for specific tags related to job industries: "crime", "police", "hospital", etc.).

# 583 Colleges Data Cleaning

In order to analyze the differences in movie genre ratings and proximity to job, we need to combine the grad df and recent_grad df. Our first step in tidying this data was to filter for data that was relevant. We removed cols pertaining to earnings, because we are not looking at earnings, but the type of jobs students get as graduates (recent_grads df) and students (grad df).

In [21]:
grad_filter = grad_583.filter(items = ["Major_code","Major","Major_category",
                                       "Grad_total", "Grad_sample_size",
                                       "Grad_employed", "Grad_full_time_year_round"])
# Keeping columns in items for grad df

recent_filter = recentgrads_583.filter(items = ["Major_code","Major","Total",
                                                "Major_category", "College_jobs",
                                                "Non_college_jobs","Low_wage_jobs"])
# Keeping columns in items for recent_grad df

Next we combined the filtered grad and recent_grad dfs into one df, joined on their common col Major. Irrelevant cols were dropped after the dfs were combined.

In [22]:
combined_majors = db.sql("SELECT * FROM grad_filter FULL JOIN recent_filter \
ON grad_filter.Major = recent_filter.Major").df()
# Joining grad and recent_grad df into one df on their common Major column

cm_drop = combined_majors.drop(columns = ['Major_code_2', "Major_2", 
                                          "Major_category_2", "Grad_sample_size", 
                                          "Major_code"])
# remove duplicate columns in the combined df

majors_final = cm_drop.rename({"Total":"recent_grad_total", "College_jobs":"rg_degree_jobs", 
                            "Non_college_jobs":"rg_ndegree_jobs", 
                               "Low_wage_jobs":"rg_low_wage_jobs"}, axis = 1).fillna(0)
# rename the columns in the combined college majors df

print(majors_final.head())
# Look at first five rows of the college majors df


print(majors_final['Major'].unique())
#This is to see which majors we are working with so that we can 
# further split up the tags based on which majors they might most closely relate to

                                    Major  \
0                   CONSTRUCTION SERVICES   
1                  HOSPITALITY MANAGEMENT   
2  COSMETOLOGY SERVICES AND CULINARY ARTS   
3              COMMUNICATION TECHNOLOGIES   
4                         COURT REPORTING   

                        Major_category  Grad_total  Grad_employed  \
0  Industrial Arts & Consumer Services        9173           7098   
1                             Business       24417          18368   
2  Industrial Arts & Consumer Services        5411           3590   
3              Computers & Mathematics        9109           7512   
4                  Law & Public Policy        1542           1008   

   Grad_full_time_year_round  recent_grad_total  rg_degree_jobs  \
0                       6511            18498.0            3275   
1                      14784            43647.0            2325   
2                       2701            10510.0             563   
3                       5622            18035.

We are now left with a df of Majors and relevant data pertaining to employment for grad students and graduates with/without jobs.

We want to look at the relationship between movies (number of, average ratings) with tags revelant to a specific major (category) and whether or not the number of graduates in that major has changed. \
First, we can split the movies dataset into two: one set will contain movies released before 2006 and the other will contain movies released between 2006-2012. \
The set of older movies would correspond to movies that came out during the childhood/college years of recent graduates and graduates already in the workforce by 2012. \
The set of newer movies would represent the same but for those still in college/graduate school. 

In [23]:
# creating movie df with movies released in and before 2006
movies_old = ml_final[ml_final.release_year <= 2006]
## creating movie df with movies released after 2006, which is relevant to our Major df 
## because the major data was collected from 2010 t0 2012
movies_new = ml_final[ml_final.release_year > 2006]

In [24]:
print(majors_final['Major_category'].unique())

['Industrial Arts & Consumer Services' 'Business'
 'Computers & Mathematics' 'Law & Public Policy'
 'Agriculture & Natural Resources' 'Communications & Journalism' 'Arts'
 'Engineering' 'Social Science' 'Health' 'Interdisciplinary'
 'Physical Sciences' 'Humanities & Liberal Arts'
 'Psychology & Social Work' 'Biology & Life Science' 'Education']


In [25]:
movies_old.fillna(0)

Unnamed: 0,avg_rating,title,genres,release_year,relevant_tags,relevan_tags
0,3.133129,"Yards, The",Crime|Drama,2000,"crime, political corruption","crime, political corruption"
1,2.865199,Next Friday,Comedy,2000,"funny, prison escape, funny as hell","funny, prison escape, funny as hell"
2,2.679353,Down to You,Comedy|Romance,2000,"teen, teen movie","teen, teen movie"
3,2.289954,Isn't She Great?,Comedy,2000,cancer,cancer
4,2.449003,Scream 3,Comedy|Horror|Mystery|Thriller,2000,"scary, stupid as hell","scary, stupid as hell"
...,...,...,...,...,...,...
49682,3.500000,Jologs,Comedy|Drama|Romance,2002,0,
49689,3.750000,Wallander 03 - The Brothers,Crime|Thriller,2005,0,
49700,5.000000,Mystic India,Documentary|Drama|Thriller,2005,0,
49704,3.000000,E-Dreams,Documentary,2001,0,


In [26]:
maj_sel = ['PRE-LAW AND LEGAL STUDIES', 'CRIMINAL JUSTICE AND FIRE PROTECTION', 
                 'COMPUTER SCIENCE' , 'MATHEMATICS', 'COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY', 
                  'MILITARY TECHNOLOGIES', 
                  'MASS MEDIA',
                  'CRIMINOLOGY', 'INTERNATIONAL RELATIONS', 'POLITICAL SCIENCE AND GOVERNMENT',
                  'ECOLOGY', 'GENETICS', 'MICROBIOLOGY',
                  'GENERAL EDUCATION',
                  'COMMERCIAL ART AND GRAPHIC DESIGN', 'MUSIC', 'STUDIO ARTS',
                  'ENGLISH LANGUAGE AND LITERATURE', 'HISTORY', 'PHILOSOPHY AND RELIGIOUS STUDIES',
                  'MINING AND MINERAL ENGINEERING', 'MECHANICAL ENGINEERING', 'ENVIRONMENTAL ENGINEERING',
                  'GENERAL BUSINESS',
                  'ASTRONOMY AND ASTROPHYSICS', 'NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES', 'MULTI-DISCIPLINARY OR GENERAL SCIENCE',
                  'TREATMENT THERAPY PROFESSIONS', 'HEALTH AND MEDICAL PREPARATORY PROGRAMS', 'GENERAL MEDICAL AND HEALTH SERVICES']

In [27]:
# Creating a series that has the relevant tags for each selected major based off of which movies 
# were from before 2006 and had the relevant tag
maj_tag_old = []

maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['lawyer', 'lawyers']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['death penalty', 'detective', 
                                                                              'private detective', 'fbi', 
                                                                              'cia', 'spy']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['computers', 
                                                                              'artificial intelligence']), 
                                                                    na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['math', 'mathematics']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['hackers', 'hacking']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['military', 'weapons']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['television']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['crime, crime gone awry']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['united nations',
                                                                              'terrorism', 'wartime']), 
                                                                    na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['political', 
                                                                              'political corruption',
                                                                              'politics']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['evolution', 'ecology', 
                                                                              'dinosaurs']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['genetics', 'clones']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['cancer']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['teacher', 'college']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['graphic design', 
                                                                              'graphic novel']), 
                                                                    na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['music', 'music business', 
                                                                              'musicians']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['stop-motion', 'stop motion', 
                                                                              'studio ghibli']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['allegory', 'fairy tale', 
                                                                              'fairy tales']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['history', 
                                                                              'bullshit history']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['metaphysics', 
                                                                              'good versus evil']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['mining']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['robot', 'robots', 
                                                                              'androids', 
                                                                              'android(s)/cyborgs']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['global warming', 
                                                                              'environment', 
                                                                              'environmental']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['business']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['alien, aliens', 
                                                                              'alien invasion', 
                                                                              'space', 'astronauts', 
                                                                              'nasa']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['nuclear', 'nuclear bomb', 
                                                                              'nuclear war']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['science', 'sci fi', 
                                                                              'scifi', 'science fiction']), 
                                                                    na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['mental health', 
                                                                              'mental hospital']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['doctors']), na = False)])
maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['hospital', 
                                                                              'terminal illness']), na = False)])

  maj_tag_old.append(movies_old[movies_old.relevant_tags.str.contains('|'.join(['robot', 'robots',


In [28]:
# Creating a series that has the relevant tags for each selected major based off of which movies 
# were from after 2006 and had the relevant tag
maj_tag_new = []

maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['lawyer', 'lawyers']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['death penalty', 'detective', 
                                                                              'private detective', 'fbi', 
                                                                              'cia', 'spy']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['computers', 
                                                                              'artificial intelligence']), 
                                                                              na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['math', 'mathematics']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['hackers', 'hacking']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['military', 'weapons']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['television']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['crime, crime gone awry']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['united nations','terrorism', 
                                                                              'wartime']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['political', 'political corruption',
                                                                              'politics']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['evolution', 'ecology', 
                                                                              'dinosaurs']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['genetics', 'clones']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['cancer']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['teacher', 'college']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['graphic design', 
                                                                              'graphic novel']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['music', 'music business', 
                                                                              'musicians']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['stop-motion', 'stop motion', 
                                                                              'studio ghibli']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['allegory', 'fairy tale', 
                                                                              'fairy tales']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['history', 'bullshit history']), 
                                                                    na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['metaphysics', 
                                                                              'good versus evil']), 
                                                                    na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['mining']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['robot', 'robots', 'androids', 
                                                                              'android(s)/cyborgs']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['global warming', 'environment', 
                                                                              'environmental']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['business']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['alien, aliens', 'alien invasion', 
                                                                              'space', 'astronauts', 
                                                                              'nasa']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['nuclear', 'nuclear bomb', 
                                                                              'nuclear war']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['science', 'sci fi', 'scifi', 
                                                                              'science fiction']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['mental health', 
                                                                              'mental hospital']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['doctors']), na = False)])
maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['hospital', 
                                                                              'terminal illness']), na = False)])

  maj_tag_new.append(movies_new[movies_new.relevant_tags.str.contains('|'.join(['robot', 'robots', 'androids',


Then for later use, we compute how many movies correspond to each major. 

In [29]:
count_old = np.zeros(len(maj_tag_old))
count_new = np.zeros(len(maj_tag_new))
avg_old = np.zeros(len(maj_tag_old))
avg_new = np.zeros(len(maj_tag_new))
for i in range(len(maj_tag_old)):
    count_old[i] = len(maj_tag_old[i]['title'])
    count_new[i] = len(maj_tag_new[i]['title'])
    avg_old[i] = maj_tag_old[i]['avg_rating'].mean()
    avg_new[i] = maj_tag_new[i]['avg_rating'].mean()
    
    
avg_old[np.isnan(avg_old)] = 0
avg_new[np.isnan(avg_new)] = 0

Now we look at the data regarding graduates. We calculated both count and proportion so we could look at how overall numbers changed through time but for comparison purposes decided it would be better to look at proportions, since outlooks on attending college and graduate school might have changed through the years. 

In [30]:
grad_tot = []
rec_grad_tot = []
for maj in maj_sel:
    row = majors_final.loc[majors_final.Major == maj]
    #grad_tot.append(row['Grad_total'])
    grad_tot.append(int(row['Grad_total']))
    rec_grad_tot.append(int(row['recent_grad_total']))
    
sum_grads = sum(grad_tot)
sum_recent = sum(rec_grad_tot)

prop_grad = np.array(grad_tot)/sum_grads
prop_recent = np.array(rec_grad_tot)/sum_recent

  grad_tot.append(int(row['Grad_total']))
  rec_grad_tot.append(int(row['recent_grad_total']))


Now we can calculate the changes in proportions of recent graduates and current graduates based on their majors. For later comparisons, we also look at how the proportions of movies changed regarding tags.

In [31]:
prop_old = count_old/sum(count_old)
prop_new = count_new/sum(count_new)
diff_prop = prop_grad-prop_recent
diff_mov = prop_new-prop_old

In [32]:
old_ratings=[]
for maj in maj_tag_old:
    old_ratings.append(maj['avg_rating'].mean())
old_ratings=pd.Series(old_ratings).fillna(0).tolist()
new_ratings=[]
for maj in maj_tag_new:
    new_ratings.append(maj['avg_rating'].mean())
new_ratings=pd.Series(new_ratings).fillna(0).tolist()

for i in range(len(maj_tag_old)):
    if old_ratings[i]==0:
        new_ratings[i]=0
    elif new_ratings[i]==0:
        old_ratings[i]=0


Now all this can be consolidated into a dataframe. 

In [33]:
maj_mov_data = {'major': maj_sel, 'movies_old': count_old, 'movies_old_prop': prop_old,
                            'movies_new': count_new, 'movies_new_prop': prop_new,
                            'old_ratings': old_ratings, 'new_ratings': new_ratings,
                            'old_grads': rec_grad_tot, 'old_grads_prop': prop_recent, 
                            'current_grads': grad_tot, 'current_grads_prop': prop_grad}
maj_mov_df = pd.DataFrame(maj_mov_data)
maj_mov_df['movies_prop_diff'] = maj_mov_df['movies_new_prop'] - maj_mov_df['movies_old_prop']
maj_mov_df['grad_prop_diff'] = maj_mov_df['current_grads_prop'] - maj_mov_df['old_grads_prop']
maj_mov_df

Unnamed: 0,major,movies_old,movies_old_prop,movies_new,movies_new_prop,old_ratings,new_ratings,old_grads,old_grads_prop,current_grads,current_grads_prop,movies_prop_diff,grad_prop_diff
0,PRE-LAW AND LEGAL STUDIES,15.0,0.018337,29.0,0.01846,3.310907,3.547766,13528,0.007141,33137,0.004992,0.000122,-0.002149
1,CRIMINAL JUSTICE AND FIRE PROTECTION,112.0,0.136919,209.0,0.133036,3.392238,3.42052,152824,0.080669,188228,0.028353,-0.003883,-0.052316
2,COMPUTER SCIENCE,6.0,0.007335,21.0,0.013367,3.407228,3.438356,128319,0.067734,324402,0.048866,0.006032,-0.018869
3,MATHEMATICS,4.0,0.00489,9.0,0.005729,3.668549,3.692159,72397,0.038215,418056,0.062973,0.000839,0.024758
4,COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY,3.0,0.003667,14.0,0.008912,3.026077,3.36206,8066,0.004258,10290,0.00155,0.005244,-0.002708
5,MILITARY TECHNOLOGIES,34.0,0.041565,61.0,0.038829,3.252002,3.294188,124,6.5e-05,3465,0.000522,-0.002736,0.000456
6,MASS MEDIA,2.0,0.002445,0.0,0.0,0.0,0.0,52824,0.027884,42915,0.006464,-0.002445,-0.021419
7,CRIMINOLOGY,15.0,0.018337,22.0,0.014004,3.527668,3.505979,19879,0.010493,18499,0.002787,-0.004334,-0.007707
8,INTERNATIONAL RELATIONS,51.0,0.062347,81.0,0.05156,3.493779,3.503813,28187,0.014879,69355,0.010447,-0.010788,-0.004432
9,POLITICAL SCIENCE AND GOVERNMENT,88.0,0.107579,151.0,0.096117,3.619341,3.55868,182621,0.096398,695725,0.104799,-0.011462,0.008401


In [34]:
# Making empty df
majmovietags = pd.DataFrame(columns = ['avg_rating', 'title', 'genres', 'release_year', 'relevant_tags', 'relevan_tags'
                                       'primary_genre', 'major']) 

# Creating dfs for all movies related to a specific major through relevant tags and adding a 'major' column
# so we know what major a movie is related to
pre_law = movies_new[movies_new.relevant_tags.str.contains('|'.join(['lawyer', 'lawyers']), na = False)]
pre_law['major'] = 'PRE-LAW AND LEGAL STUDIES'
criminal_j = movies_new[movies_new.relevant_tags.str.contains('|'.join(['death penalty', 'detective', 
                                                                              'private detective', 'fbi', 
                                                                              'cia', 'spy']), na = False)]
criminal_j['major'] = 'CRIMINAL JUSTICE AND FIRE PROTECTION'
cs = movies_new[movies_new.relevant_tags.str.contains('|'.join(['computers', 
                                                                'artificial intelligence']), 
                                                                    na = False)]
cs['major'] = 'COMPUTER SCIENCE'
math = movies_new[movies_new.relevant_tags.str.contains('|'.join(['math', 'mathematics']), na = False)]
math['major'] = 'MATHEMATICS'
cit = movies_new[movies_new.relevant_tags.str.contains('|'.join(['hackers', 'hacking']), na = False)]
cit['major'] = 'COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY'
military = movies_new[movies_new.relevant_tags.str.contains('|'.join(['military', 'weapons']), na = False)]
military['major'] = 'MILITARY TECHNOLOGIES'
media = movies_new[movies_new.relevant_tags.str.contains('|'.join(['television']), na = False)]
media['major'] = 'MASS MEDIA'
criminology = movies_new[movies_new.relevant_tags.str.contains('|'.join(['crime, crime gone awry']), na = False)]
criminology['major'] = 'CRIMINOLOGY'
interna_rela = movies_new[movies_new.relevant_tags.str.contains('|'.join(['united nations','terrorism', 
                                                                              'wartime']), na = False)]
interna_rela['major'] = 'INTERNATIONAL RELATIONS'
polisci = movies_new[movies_new.relevant_tags.str.contains('|'.join(['political', 'political corruption',
                                                                              'politics']), na = False)]
polisci['major'] = 'POLITICAL SCIENCE AND GOVERNMENT'
ecology = movies_new[movies_new.relevant_tags.str.contains('|'.join(['evolution', 'ecology', 
                                                                              'dinosaurs']), na = False)]
ecology['major'] = 'ECOLOGY'
genetic = movies_new[movies_new.relevant_tags.str.contains('|'.join(['genetics', 'clones']), na = False)]
genetic['major'] = 'GENETICS'
microbio = movies_new[movies_new.relevant_tags.str.contains('|'.join(['cancer']), na = False)]
microbio['major'] = 'MICROBIOLOGY'
gened = movies_new[movies_new.relevant_tags.str.contains('|'.join(['teacher', 'college']), na = False)]
gened['major'] = 'GENERAL EDUCATION'
graphic_design = movies_new[movies_new.relevant_tags.str.contains('|'.join(['graphic design', 
                                                                              'graphic novel']), na = False)]
graphic_design['major'] = 'COMMERCIAL ART AND GRAPHIC DESIGN'
music = movies_new[movies_new.relevant_tags.str.contains('|'.join(['music', 'music business', 
                                                                              'musicians']), na = False)]
music['major'] = 'MUSIC'
art = movies_new[movies_new.relevant_tags.str.contains('|'.join(['stop-motion', 'stop motion', 
                                                                              'studio ghibli']), na = False)]
art['major'] = 'STUDIO ARTS'
eng = movies_new[movies_new.relevant_tags.str.contains('|'.join(['allegory', 'fairy tale', 
                                                                              'fairy tales']), na = False)]
eng['major'] = 'ENGLISH LANGUAGE AND LITERATURE'
history = movies_new[movies_new.relevant_tags.str.contains('|'.join(['history', 'bullshit history']), 
                                                                    na = False)]
history['major'] = 'HISTORY'
philo = movies_new[movies_new.relevant_tags.str.contains('|'.join(['metaphysics', 
                                                                    'good versus evil']), 
                                                                    na = False)]
philo['major'] = 'PHILOSOPHY AND RELIGIOUS STUDIES'
mining = movies_new[movies_new.relevant_tags.str.contains('|'.join(['mining']), na = False)]
mining['major'] = 'MINING AND MINERAL ENGINEERING'
meche = movies_new[movies_new.relevant_tags.str.contains('|'.join(['robot', 'robots', 'androids', 
                                                                              'android(s)/cyborgs']), na = False)]
meche['major'] = 'MECHANICAL ENGINEERING'
env_eng = movies_new[movies_new.relevant_tags.str.contains('|'.join(['global warming', 'environment', 
                                                                              'environmental']), na = False)]
env_eng['major'] = 'ENVIRONMENTAL ENGINEERING'
business = movies_new[movies_new.relevant_tags.str.contains('|'.join(['business']), na = False)]
business['major'] = 'GENERAL BUSINESS'
astro = movies_new[movies_new.relevant_tags.str.contains('|'.join(['alien, aliens', 'alien invasion', 
                                                                              'space', 'astronauts', 
                                                                              'nasa']), na = False)]
astro['major'] = 'ASTRONOMY AND ASTROPHYSICS'
nuclear = movies_new[movies_new.relevant_tags.str.contains('|'.join(['nuclear', 'nuclear bomb', 
                                                                              'nuclear war']), na = False)]
nuclear['major'] = 'NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES'
gen_sci = movies_new[movies_new.relevant_tags.str.contains('|'.join(['science', 'sci fi', 'scifi', 
                                                                              'science fiction']), na = False)]
gen_sci['major'] = 'MULTI-DISCIPLINARY OR GENERAL SCIENCE'
therapy = movies_new[movies_new.relevant_tags.str.contains('|'.join(['mental health', 
                                                                              'mental hospital']), na = False)]
therapy['major'] = 'TREATMENT THERAPY PROFESSIONS'
med = movies_new[movies_new.relevant_tags.str.contains('|'.join(['doctors']), na = False)]
med['major'] = 'HEALTH AND MEDICAL PREPARATORY PROGRAMS'
gen_med = movies_new[movies_new.relevant_tags.str.contains('|'.join(['hospital', 
                                                                              'terminal illness']), na = False)]
gen_med['major'] = 'GENERAL MEDICAL AND HEALTH SERVICES'

# concatenating all of the individual major>movie dataframes together on rows
majmovietags_final = pd.concat([pre_law, criminal_j,gen_med,med,therapy,gen_sci,nuclear,astro,business,env_eng,meche,
           mining,philo,history,eng,art,music,graphic_design,gened,microbio,genetic,
           genetic,ecology,polisci,interna_rela,criminology,media,military,cit,math,cs], axis = 0)

# Finding primary genre for each movie
majmovietags_final['genres'] = majmovietags_final.genres.str.split('|')
majmovietags_final = majmovietags_final.explode('genres')
subset_col = ['title']
primary = majmovietags_final.drop_duplicates(subset=subset_col)
majmovietags_final['primary_genre'] = primary['genres'].rename({"genre":"primary_genre"})
majmovietags_final['primary_genre'] = majmovietags_final.primary_genre.replace({"Action":1,"Adventure":2,\
                                                "Animation":3,"Children":4, "Comedy":5,\
                                          "Crime":6, "Documentary":7, "Drama":8, "Fantasy":9,\
                                          "Film-Noir":10, "Horror":11, "Musical":12, "Mystery":13,\
                                          "Romance":14, "Sci-Fi":15, "Thriller":16, "War":17, "Western":18,\
                                          "(no genres listed)": 0})

# Getting dummies for 'major' and 'primary_genre' columns
majmovietags_final = pd.get_dummies(majmovietags_final, columns = ['major'])
majmovietags_final = majmovietags_final.replace({True:1, False:0})

#removing redundant col
majmovietags_final = majmovietags_final.drop(columns = ['relevan_tags', 'genres'])

#renaming columns
majmovietags_final = majmovietags_final.rename(columns={
                'major_PRE-LAW AND LEGAL STUDIES':'PRE-LAW AND LEGAL STUDIES' , 
                    'major_CRIMINAL JUSTICE AND FIRE PROTECTION':'CRIMINAL JUSTICE AND FIRE PROTECTION', \
                 'major_COMPUTER SCIENCE':'COMPUTER SCIENCE' , 'major_MATHEMATICS':'MATHEMATICS', \
'major_COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY':'COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY', \
                  'major_MILITARY TECHNOLOGIES':'MILITARY TECHNOLOGIES', \
                  'major_CRIMINOLOGY':'CRIMINOLOGY', 'major_INTERNATIONAL RELATIONS':'INTERNATIONAL RELATIONS',\
                'major_POLITICAL SCIENCE AND GOVERNMENT': 'POLITICAL SCIENCE AND GOVERNMENT', \
                  'major_ECOLOGY':'ECOLOGY', 'major_GENETICS':'GENETICS',\
                    'major_MICROBIOLOGY':'MICROBIOLOGY', \
                  'major_GENERAL EDUCATION':'GENERAL_EDUCATION', \
                  'major_COMMERCIAL ART AND GRAPHIC DESIGN':'COMMERCIAL ART AND GRAPHIC DESIGN',\
                    'major_MUSIC':'MUSIC', 'major_STUDIO ARTS':'STUDIO ARTS', \
                  'major_ENGLISH LANGUAGE AND LITERATURE':'ENGLISH LANGUAGE AND LITERATURE', \
                'major_PHILOSOPHY AND RELIGIOUS STUDIES':'PHILOSOPHY AND RELIGIOUS STUDIES',\
                  'major_MINING AND MINERAL ENGINEERING':'MINING AND MINERAL ENGINEERING',
                            'major_MECHANICAL ENGINEERING':'MECHANICAL ENGINEERING', \
                              'major_ENVIRONMENTAL ENGINEERING':'ENVIRONMENTAL ENGINEERING',\
                  'major_GENERAL BUSINESS':'GENERAL BUSINESS',\
                                    'major_ASTRONOMY AND ASTROPHYSICS':'ASTRONOMY AND ASTROPHYSICS', \
'major_NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES': \
                                    'NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES', \
           'major_MULTI-DISCIPLINARY OR GENERAL SCIENCE':'MULTI-DISCIPLINARY OR GENERAL SCIENCE',\
            'major_HEALTH AND MEDICAL PREPARATORY PROGRAMS':'HEALTH AND MEDICAL PREPATORY PROGRAMS', \
           'major_GENERAL MEDICAL AND HEALTH SERVICES':'GENERAL MEDICAL AND HEALTH SERVICES'})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pre_law['major'] = 'PRE-LAW AND LEGAL STUDIES'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  criminal_j['major'] = 'CRIMINAL JUSTICE AND FIRE PROTECTION'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cs['major'] = 'COMPUTER SCIENCE'
A value is trying to be set on a copy of a slice from a DataFram

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  astro['major'] = 'ASTRONOMY AND ASTROPHYSICS'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nuclear['major'] = 'NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gen_sci['major'] = 'MULTI-DISCIPLINARY OR GENERAL SCIENCE'
A value is trying to

In [35]:
maj_mov_df.to_csv('majors_movies.csv')
movies_new.to_csv('new_movies.csv')
movies_old.to_csv('old_movies.csv')
majmovietags_final.to_csv('majmovietags_final.csv')