# What is data wrangling all about?

Data wrangling with programming languages such as python and R serves two main purposes:

1. **Cleaning data**, which involves things like removing leading and trailing spaces.
2. **Reshaping data**, which is what we do when the data contains all the information we need, but is presented in a way that does not fit our use case.

The brief examples that follow use a python tool called *pandas*. This is not an extensive tutorial, but is intended rather, to give you a taste of how quickly data can be transformed when the appropriate tools are employed. If the examples that follow whet your appetite, then I would encourage you to pursue further training either in the *pandas* tool for python (https://pandas.pydata.org), or the *tidyverse* tools for the R programming language, especially *dplyr* (https://www.tidyverse.org).

___

Across this demonstration we'll be looking at how to reshape and clean a dataset of movies, genres and actors. Let's get started!

The following piece of code imports the *pandas* tool into our Jupyter Notebook, which is the name of the programming environment with which we are working:

In [None]:
import pandas as pd

Next we import the data we'll be working with, and assign it the name `df`. The name could be anything, but `df` stands for DataFrame, which is the name of the tabular structures used in *pandas*. You can think of a DataFrame as somewhat equivalent to a single sheet in Excel.

The first thing we should do is check that this new table is exactly three times as long as before, since the three *actors* columns were essentially stacked one on top of the other.

We can check this with the following:

In [None]:
melted_df.shape

In [None]:
# Divide the row count of the new table by the row count of the old one
melted_df.shape[0] / df.shape[0]

Next we will remove the *variable* column, since this does not offer useful information about the actors:

In [None]:
melted_df.drop('variable', inplace=True, axis='columns')

In [None]:
melted_df.head()

Below we list the actors in the new column, alongside a count of how many times they appear (i.e. how many movies they acted in). For instance, Robert De Niro has appeared in 54 movies:

In [None]:
melted_df.actors.value_counts(dropna=False)[:20]

In the list above, the third value is `NaN`, with a count of 43. `NaN` stands for *Not a Number*, but what it really means is that the field was empty. We can see examples of this below:

In [None]:
melted_df[melted_df.actors.isnull()].head()

These rows with no actors listed are a side effect of the original table with *actor_1*, *actor_2* and *actor_3*. At times there were fewer than three actors listed, in which case one or more of those columns was left empty.

In our new table, however, these rows are no longer required, they simply duplicate movies and their corresponding genres without listing any new actors. They can be removed as follows:

In [None]:
'  The Good Dinosaur      '.strip()

This also works for the non-breaking spaces shown above:

In [None]:
'The Good Dinosaur\xa0'.strip()

In order to tidy our actors dataset, we need to visit every movie for each actor, and remove superfluous spaces.

*pandas* makes this relatively painless, although it is complicated a little by the fact that in a given row, we have multiple movies in the *movie_titles* column:

In [None]:
combined['movie_titles'] = combined.movie_titles.apply(lambda x: [el.strip() for el in x])

Now if we look again at the movies in which A.J. Buckley acted, we see that those sneaky little spaces are no longer there:

In [None]:
for val in combined[combined.actors=='A.J. Buckley'].movie_titles.values:
    print(val)

Although the *actors* and *genres* columns look OK, we can repeat the process of stripping spaces with them just to be sure. 
This is a little more complicated for the *genres* column because it contains multiple values:

In [None]:
combined['actors'] = combined.actors.str.strip()

In [None]:
combined['genres'] = combined.genres.apply(lambda genres: {genre.strip() for genre in genres} )

In [None]:
combined.head()

Finally, we may want our movies and genres to be sorted in alphabetical order. It is important that this is done **after** removing extra spaces:

In [None]:
melted_df = pd.melt(df, id_vars=["movie_title", "genres"]  , value_name="actors")

Below are the first five rows of our new table. Note that there is now a single column for actors, and another column that specifies whether the actor was numbered one, two or three in the original table.

In [None]:
melted_df.dropna(subset=['actors'], inplace=True)

If we look again at the movie count for different actors you will notice that the `NaN` values are gone:

In [None]:
melted_df.actors.value_counts()[:20]

Here again is a reminder of what our table looks like currently:

The next step is to group the data such that we have a single row for each unique actor. Within this row all the movies they have acted in will appear in a list:

In [None]:
actors_movies = melted_df.groupby('actors').movie_title.apply(list).to_frame().reset_index()

A lot happened in the above bit of code. It employs some advanced data wrangling concepts that are beyond the scope of this intro. 

For now, note simply that we obtained the result we were after with only a small amount of code.

In the table below, each actor has a row, while all of the movies they have acted in appear in a list:

In [None]:
combined[combined.actors=='A.J. Buckley'].movie_titles

If we look very closely above, we can see that there is a space between each movie title and the comma that follows.

As it happens, this is a special kind of space called a non-breaking space (https://en.wikipedia.org/wiki/Non-breaking_space). It is often represented in a string of text with the following `\xa0`.
We can see this explicitly below:

Given a string of text like the following, with leading and/or trailing spaces, `"  The Good Dinosaur      "`, superfluous spaces can be removed with the python `strip` function:

In [None]:
df = pd.read_csv('movie_reduced.csv')

After importing the data, we show the first five rows below. Consider the dataset. Have you ever seen something similar? 

It contains a list of movies, columns containing the names of three people who acted in the movie, and a list of genres that describe the movie, all in a single column. 

This is perhaps OK if we want to know about a particular movie, but what if we are more interested in individual genres or actors?

In [None]:
df.head()

In the example that follows, we are going to imagine that we want to make a Collection in Omeka that is all about the actors. For a given actor, we will have a list of all of the movies that they have acted in and the genres associated with those movies. Eventually, you could build on this approach to have an Omeka database with separate collections for Movies, Actors and Genres, with links connecting them together.

The main problem with the dataset above is that all of the actors are listed in separate columns. Note too that not all movies have three actors listed.

The first, and most important step in transforming this data, is to put all of the actors in a single column.
For our purposes, this is **not** done by concatenating the columns together, as has been done with *genres*.

Instead, we can imagine that the three actor columns are stacked vertically, one on top of the other, with the *movie_title* and *genres* duplicated as required.

Currently, our table has around 5000 rows:

In [None]:
df.shape

Once the actors are placed into a single column, this will triple.

To perform this transformation, we can use a *pandas* function called `melt`, which can melt down multiple columns into a single column by stacking them one on top of the other.
For this to work, we just need to specify those columns that we don't want to melt down: in this case those are the *movie_title* and *genres* columns.

In [None]:
actors_movies.head(10)

If we want to see the movies for a single actor, we can do so as follows:

In [None]:
for movie in actors_movies[actors_movies.actors=='50 Cent'].movie_title.iloc[0]:
    print(movie)

Let's return to our early table as shown below:

We now want to get a list of the genres associated with each actor and can do so in a similar manner to above.

A first step here is to split the *genres* column into a list of genres, instead of a single string of text, as is currently the case.

Below we tell pandas to split this column wherever the pipe `|` character appears:

In [None]:
melted_df['genres'] = melted_df.genres.str.split('|')

We can now create a table of actors and the genres they have been associated with.

Again, a deep exploration of the code used here is beyond the scope of this data wrangling taster.

However, it serves to give you a sense of the kind of data superpowers you can look forward to if you commit to learning more about data wrangling, be it with python, R, or some other tool.

In [None]:
from itertools import chain
actors_genres = melted_df.groupby('actors').genres.apply(lambda x: set(chain.from_iterable(x.values))).to_frame().reset_index()

In [None]:
actors_genres.head(20)

To confirm that this all worked, let's take a look at the movies of 'AJ Michalka', listed in the row marked `5` above. We can see that the genres correspond exactly, with duplicates removed in the `actors_genres` table:

In [None]:
melted_df[melted_df.actors == 'AJ Michalka']

After all this we now have two new tables, one with actors and genres, and another with actors and movies. They both have 6255 rows (one row for each actor) and two columns. 

We can check this with the `shape` attribute of each table:

In [None]:
combined['movie_titles'] = combined.movie_titles.apply(sorted)
combined['genres'] = combined.genres.apply(sorted)

This gives the following:

In [None]:
combined.head(20)

There is more we could do with this dataset. For instance, it doesn't seem right that we have both AJ and A.J. used for initials. If you're up for a challenge, you could investigate the pandas `replace` function to see if you can work out how we could remove full stops from initials: https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.str.replace.html

Having come to the end of this data wrangling taster, the very last step is to export our data as a *csv* file for use elsewhere:

In [None]:
combined.to_csv('combined.csv')

___ 

# Coda

The main code used across the whole of this notebook is shown below, with comments preceded by the `#` symbol. We accomplished a lot in only fifteen lines of code!

In [None]:
###### IMPORTING DATA ######

#1 Import pandas 
import pandas as pd

#2 Read our data from a csv into a table named df
df = pd.read_csv('movie_reduced.csv')



###### DATA WRANGLING ######

#3 Stack our actor_1, actor_2 and actor_3 columns on top of each other with the pandas melt function
melted_df = pd.melt(df, id_vars=["movie_title", "genres"]  , value_name="actors")

#4 Split up the genres that are separated with the "|" character
melted_df['genres'] = melted_df.genres.str.split('|')

#5 Create a table with a row for each actor and a list of the movies they acted in
actors_movies = melted_df.groupby('actors').movie_title.apply(list).to_frame().reset_index()

#6 Import a tool for connecting lists of items (used in the next step)
from itertools import chain

#7 Create a table with a row for each actor and a list of the genres they have been part of
actors_genres = melted_df.groupby('actors').genres.apply(lambda x: set(chain.from_iterable(x.values))).to_frame().reset_index()

#8 Combine the movies and genres tables created above
combined = pd.merge(left=actors_movies, right=actors_genres, on="actors")

#9 Rename the columns in the combined table (changing movie_title to movie_titles)
combined.columns = ['actors', 'movie_titles', 'genres']



###### REMOVING SPACES ######

#10 Remove any leading or trailing spaces from every movie in the movie_titles column
combined['movie_titles'] = combined.movie_titles.apply(lambda x: [el.strip() for el in x])

#11 Remove any leading or trailing spaces from every actor in the actors column
combined['actors'] = combined.actors.str.strip()

#12 Remove any leading or trailing spaces from every genre in the genres column
combined['genres'] = combined.genres.apply(lambda genres: {genre.strip() for genre in genres} )



###### SORTING ######

#13 Sort the movies in the movie_titles column
combined['movie_titles'] = combined.movie_titles.apply(sorted)

#14 Sort the genres in the genres column
combined['genres'] = combined.genres.apply(sorted)



###### EXPORT ######

#15 Export our table to csv 
combined.to_csv('combined.csv')

Let's run all this code together and see where we began and where we ended up below:

In [None]:
actors_genres.shape

In [None]:
actors_movies.shape

The last step is to combine these tables using the "actor" column, which appears in both, to join them:

In [None]:
combined = pd.merge(left=actors_movies, right=actors_genres, on="actors")
combined.columns = ['actors', 'movie_titles', 'genres']

# Data cleaning

Happy as we might be with all the work we did above, there are still a few issues with the data that need attention.

Let's look at some of the movies in which A.J. Buckley acted.