---
<center><h1>Lesson 2 - Basic intro into pandas</h1></center> 
---
---

<center><h2>Part 5. Work with pandas DataFrames: join, merge and concatenate</h2></center>

---

## Table of Contents

- [Work with pandas DataFrames: join, merge and concatenate](#Work-with-pandas-DataFrames:-join,-merge-and-concatenate)
    * [Concatenation](#Concatenation)
    * [Merging and joining](#Merging-and-joining)
    - [*Exercise 5.1*](#Exercise-5.1)

In [1]:
import pandas as pd
import numpy as np
import random

## Work with pandas DataFrames: join, merge and concatenate

[[back to top]](#Table-of-Contents)

Now we will teach the operations, which we can do with two or more DataFrames at once – merging, joining, concatenation and the possibility of adding of new rows to a DataFrames like adding new list’s elements by using `.append()` method of Python lists.

### Concatenation

[[back to top]](#Table-of-Contents)

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join/merge-type operations. We can use the concat function in Pandas to append either columns or rows from one DataFrame to another. Let’s grab three subsets of `movies` to see how this works.

In [3]:
movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")

In [4]:
mov1, mov2, mov3 = movies.loc[:33333], movies.loc[33334:66666], movies.loc[66667:, :]
mov2.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
33334,10,50,5,877888545,53.0,M,lawyer,90703,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33335,201,50,4,884114471,27.0,M,writer,E2A4H,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33336,287,50,5,875334271,21.0,M,salesman,31211,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33337,246,50,5,884920788,19.0,M,student,28734,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33338,249,50,4,879571695,25.0,M,student,84103,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33339,99,50,5,885679998,20.0,M,student,63129,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33340,178,50,5,882823857,26.0,M,other,49512,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33341,251,50,5,886272086,28.0,M,doctor,85032,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33342,25,50,5,885852150,39.0,M,,55107,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33343,59,50,5,888205087,49.0,M,educator,08403,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0


Can you call row indexes of `movies`, which `mov2` and `mov3` have? 
Let’s join `mov2` and `mov3` into one DataFrame

In [5]:
mov23_0 = pd.concat([mov2, mov3], axis=0)
# or mov23_0 = pd.concat([mov2, mov3])
mov23_0

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
33334,10,50,5,877888545,53.0,M,lawyer,90703,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33335,201,50,4,884114471,27.0,M,writer,E2A4H,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33336,287,50,5,875334271,21.0,M,salesman,31211,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33337,246,50,5,884920788,19.0,M,student,28734,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33338,249,50,4,879571695,25.0,M,student,84103,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33339,99,50,5,885679998,20.0,M,student,63129,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33340,178,50,5,882823857,26.0,M,other,49512,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33341,251,50,5,886272086,28.0,M,doctor,85032,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33342,25,50,5,885852150,39.0,M,,55107,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33343,59,50,5,888205087,49.0,M,educator,08403,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0


When we concatenate DataFrames, we need to specify the axis. `axis=0 `tells pandas to stack the second DataFrame under the first one. It will automatically detect whether the column names are the same and will stack accordingly. `axis=1` will stack the columns in the second DataFrame to the right of the first DataFrame. To stack the data vertically, we need to make sure we have the same columns and associated column format in both datasets. When we stack horizonally, we want to make sure what we are doing makes sense (i.e. the data are related in some way).

We might obtain the same DataFrame using method `df.append(other_df)`, where `other_df` is a DataFrame that we want to stack under the `df` DataFrame

In [6]:
mov23_0_2 = mov2.append(mov3)
mov23_0_2

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
33334,10,50,5,877888545,53.0,M,lawyer,90703,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33335,201,50,4,884114471,27.0,M,writer,E2A4H,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33336,287,50,5,875334271,21.0,M,salesman,31211,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33337,246,50,5,884920788,19.0,M,student,28734,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33338,249,50,4,879571695,25.0,M,student,84103,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33339,99,50,5,885679998,20.0,M,student,63129,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33340,178,50,5,882823857,26.0,M,other,49512,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33341,251,50,5,886272086,28.0,M,doctor,85032,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33342,25,50,5,885852150,39.0,M,,55107,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0
33343,59,50,5,888205087,49.0,M,educator,08403,Star Wars (1977),1977-01-01,...,0,0,0,0,0,1,1,0,1,0


You may concatenate not only two DataFrame, but the arbitrary amount

In [7]:
dfs = [mov1, mov2, mov3]
mov_all = pd.concat(dfs, ignore_index=True)
mov_all

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


In [8]:
mov_all.loc[1,:]

user_id                                                     305
movie_id                                                    242
rating                                                        5
timestamp                                             886307828
age                                                          23
gender                                                        M
occupation                                           programmer
zip_code                                                  94086
movie_title                                        Kolya (1996)
release_date                                         1997-01-24
IMDb_URL        http://us.imdb.com/M/title-exact?Kolya%20(1996)
unknown                                                       0
Action                                                        0
Adventure                                                     0
Animation                                                     0
Childrens                               

As you can see, pandas does not reset indexing at concatenation by default and remain indexes of each concatenating DataFrame. To reset indexes we used argument `ignore_index=True`.
Sometimes it is very comfortably to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this using the `keys` argument:

In [9]:
mov13_0_keyses = pd.concat([mov1, mov3], axis=0, keys=['1', '3'])
mov13_0_keyses.head(10)

Unnamed: 0,Unnamed: 1,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
1,0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0
1,9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,0,0,0,0,0,0,0,0,0,0


and then call the specific group by the set `unique index` 

In [10]:
mov13_0_keyses.ix['3'].head(10)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
66667,735,741,2,876698796,29.0,F,healthcare,85719,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66668,774,741,1,888558762,30.0,M,student,80027,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66669,789,741,5,880332148,29.0,M,other,55420,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66670,825,741,4,881343947,44.0,M,engineer,05452,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66671,831,741,2,891354726,21.0,M,other,33765,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66672,889,741,4,880177131,24.0,M,technician,78704,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66673,916,741,3,880843401,27.0,M,,N2L5N,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66674,919,741,3,875288805,25.0,M,other,14216,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66675,913,741,4,881037004,27.0,M,student,76201,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0
66676,923,741,5,880387792,21.0,M,student,E2E3R,"Last Supper, The (1995)",1996-04-05,...,0,0,0,0,0,0,0,1,0,0


As we mentioned above, you may glue two or more DataFrames not only in vertical direction but in horizontal one also by using argument `axis=1`

In [11]:
mov13_1 = pd.concat([mov1, mov3], axis=1)
mov13_1

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196.0,242.0,3.0,881250949.0,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,,,,,,,,,,
1,305.0,242.0,5.0,886307828.0,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,,,,,,,,,,
2,6.0,242.0,4.0,883268170.0,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,,,,,,,,,,
3,234.0,242.0,4.0,891033261.0,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,,,,,,,,,,
4,63.0,242.0,3.0,875747190.0,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,,,,,,,,,,
5,181.0,242.0,1.0,878961814.0,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,,,,,,,,,,
6,201.0,242.0,4.0,884110598.0,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,,,,,,,,,,
7,249.0,242.0,5.0,879571438.0,25.0,M,student,84103,Kolya (1996),1997-01-24,...,,,,,,,,,,
8,13.0,242.0,2.0,881515193.0,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,,,,,,,,,,
9,279.0,242.0,3.0,877756647.0,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,,,,,,,,,,


You may wonder at this gluing recipe, but pandas searches the equal identifications for concatenation (columns names at `axis=0` and indexes at `axis=1`). We hadn’t any common indexes for `mov3` and `mov1` above. Let’s create a copy of `mov3` and change its indexes

In [12]:
mov32 = mov3.copy()
mov32.index = range(10000,43333)
mov13_2 = pd.concat([mov1, mov32], axis=1)
# or pd.concat([cityA, cityC2], axis=1, join='outer')
mov13_2

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196.0,242.0,3.0,881250949.0,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,,,,,,,,,,
1,305.0,242.0,5.0,886307828.0,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,,,,,,,,,,
2,6.0,242.0,4.0,883268170.0,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,,,,,,,,,,
3,234.0,242.0,4.0,891033261.0,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,,,,,,,,,,
4,63.0,242.0,3.0,875747190.0,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,,,,,,,,,,
5,181.0,242.0,1.0,878961814.0,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,,,,,,,,,,
6,201.0,242.0,4.0,884110598.0,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,,,,,,,,,,
7,249.0,242.0,5.0,879571438.0,25.0,M,student,84103,Kolya (1996),1997-01-24,...,,,,,,,,,,
8,13.0,242.0,2.0,881515193.0,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,,,,,,,,,,
9,279.0,242.0,3.0,877756647.0,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,,,,,,,,,,


pandas allows us remain only rows with common indexing with the help of using the `join='inner'` argument. Pay attention that using `join='outer'` argument we will get the previous result.

In [13]:
mov13_1_join = pd.concat([mov1, mov32], axis=1, join='inner')
mov13_1_join

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
10000,286,44,3,877532173,27.0,M,student,15217,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10001,210,44,3,887737710,39.0,M,engineer,03060,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10002,303,44,4,879484480,19.0,M,,14853,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10003,194,44,4,879524007,38.0,M,administrator,02154,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10004,234,44,3,892335707,60.0,M,retired,94702,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10005,308,44,4,887740451,60.0,M,retired,95076,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10006,276,44,3,874795637,21.0,M,student,95064,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10007,7,44,5,891351728,57.0,M,administrator,91344,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10008,59,44,4,888206048,49.0,M,educator,08403,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0
10009,42,44,3,881108548,30.0,M,,17870,Dolores Claiborne (1994),1994-01-01,...,0,0,0,0,0,0,0,1,0,0


Can you say what rows we will display using `join='inner'` argument for for `mov1` and `mov3` DataFrames?

You may also remain only all `not-null` rows for the left DataFrame using `join_axes` argument 

In [14]:
mov13_1_join_2 = pd.concat([mov1, mov3], axis=1, join_axes=[mov1.index])
mov13_1_join_2

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,881250949,49.0,M,writer,55105,Kolya (1996),1997-01-24,...,,,,,,,,,,
1,305,242,5,886307828,23.0,M,programmer,94086,Kolya (1996),1997-01-24,...,,,,,,,,,,
2,6,242,4,883268170,42.0,M,executive,98101,Kolya (1996),1997-01-24,...,,,,,,,,,,
3,234,242,4,891033261,60.0,M,retired,94702,Kolya (1996),1997-01-24,...,,,,,,,,,,
4,63,242,3,875747190,31.0,M,marketing,75240,Kolya (1996),1997-01-24,...,,,,,,,,,,
5,181,242,1,878961814,26.0,M,executive,21218,Kolya (1996),1997-01-24,...,,,,,,,,,,
6,201,242,4,884110598,27.0,M,writer,E2A4H,Kolya (1996),1997-01-24,...,,,,,,,,,,
7,249,242,5,879571438,25.0,M,student,84103,Kolya (1996),1997-01-24,...,,,,,,,,,,
8,13,242,2,881515193,47.0,M,educator,29206,Kolya (1996),1997-01-24,...,,,,,,,,,,
9,279,242,3,877756647,33.0,M,programmer,85251,Kolya (1996),1997-01-24,...,,,,,,,,,,


And at the end of this subparagraph let’s demonstrate the other way of rows appending to a DataFrame

In [15]:
# creating new record
new_row = pd.Series(
    [6,768,5,903268170,42,'M','executive',98101,'Interstellar(2014)','8-Nov-2014',np.NaN,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], 
    index=['user_id', 'movie_id', 'rating', 'timestamp', 'age', 'gender', 'occupation', 'zip_code', 
           'movie_title', 'release_date', 'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation',
           'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 
           'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'
          ]
)
mov_all.append(new_row, ignore_index=True).tail(10)

Unnamed: 0,user_id,movie_id,rating,timestamp,age,gender,occupation,zip_code,movie_title,release_date,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
99991,840,1674,4,891211682,39.0,M,artist,55406,Mamma Roma (1962),1962-01-01,...,0,0,0,0,0,0,0,0,0,0
99992,851,1676,2,875731674,,M,other,29646,"War at Home, The (1996)",1996-01-01,...,0,0,0,0,0,0,0,0,0,0
99993,851,1675,3,884222085,18.0,M,other,29646,"Sunchaser, The (1996)",1996-10-25,...,0,0,0,0,0,0,0,0,0,0
99994,854,1677,3,882814368,29.0,F,student,55408,Sweet Nothing (1995),1996-09-20,...,0,0,0,0,0,0,0,0,0,0
99995,863,1679,3,889289491,17.0,M,student,60089,B. Monkey (1998),1998-02-06,...,0,0,0,0,0,1,0,1,0,0
99996,863,1678,1,889289570,,M,student,60089,Mat' i syn (1997),1998-02-06,...,0,0,0,0,0,0,0,0,0,0
99997,863,1680,2,889289570,17.0,M,student,60089,Sliding Doors (1998),1998-01-01,...,0,0,0,0,0,1,0,0,0,0
99998,896,1681,3,887160722,28.0,M,writer,91505,You So Crazy (1994),1994-01-01,...,0,0,0,0,0,0,0,0,0,0
99999,916,1682,3,880845755,27.0,M,engineer,N2L5N,Scream of Stone (Schrei aus Stein) (1991),1996-03-08,...,0,0,0,0,0,0,0,0,0,0
100000,6,768,5,903268170,42.0,M,executive,98101,Interstellar(2014),8-Nov-2014,...,0,0,0,0,0,0,0,0,0,0


In [16]:
movies.loc[55555,:]

user_id                                                       195
movie_id                                                      751
rating                                                          4
timestamp                                               883295500
age                                                            42
gender                                                          M
occupation                                              scientist
zip_code                                                    93555
movie_title                            Tomorrow Never Dies (1997)
release_date                                           1997-01-01
IMDb_URL        http://us.imdb.com/M/title-exact?imdb-title-12...
unknown                                                         0
Action                                                          1
Adventure                                                       0
Animation                                                       0
Childrens 

### Merging and joining

[[back to top]](#Table-of-Contents)

pandas has a powerful and very fast functional for joining two DataFrames at once, which is idiomatically very similar to relational databases like `SQL`. Joining possibility is presented by pandas function `merge` and DataFrame’s method `join`. The first of them joins two DataFrames by one or more columns with some common data remaining at this those rows where defined common column of both DataFrames contains equals content. The second one joins two DataFrames, which have no common columns (with the same names), based on common indexes. It uses `merge` internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns.  That’s why, if joining DataFrames have at least one column with the same name, the method `join` won’t work. 

To demonstrate the particularities of work with the function merge and the method join let’s take two DataFrames, `'movieuser'` `'movielense'` , from the beginning of the course

In [17]:
movieuser = pd.read_csv('data/u.user', sep='|',engine='python', names = ['user_id' , 'age' , 'gender' , 'occupation' , 'zip_code'])
movielense = pd.read_csv('data/u.data', sep='\t',engine='python', names = ['user_id', 'movie_id', 'rating', 'timestamp'])

In [18]:
movieuser['timestamp'] = 1
movieuser['timestamp']= movieuser['timestamp'].apply(lambda x: x*movielense.loc[random.randint(0,99999),['timestamp']])
movieuser['age'] = movieuser['age'].map(lambda x: np.nan if not random.randint(0,15) else x)
movieuser['occupation'] = movieuser['occupation'].map(lambda x: np.nan if not random.randint(0,15) else x)
movielense['timestamp'] = movielense['timestamp'].map(lambda x: np.nan if not random.randint(0,15) else x)
movieuser.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code,timestamp
0,1,24.0,M,technician,85711,882140425
1,2,53.0,F,other,94043,882608107
2,3,23.0,M,writer,32067,892662523
3,4,24.0,M,technician,43537,891397667
4,5,33.0,F,other,15213,878148425
5,6,42.0,M,executive,98101,878974783
6,7,57.0,M,administrator,91344,887742798
7,8,36.0,M,administrator,5201,891240520
8,9,29.0,M,student,1002,874874308
9,10,,M,lawyer,90703,891308946


In [19]:
movielense.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949.0
1,186,302,3,891717742.0
2,22,377,1,878887116.0
3,244,51,2,880606923.0
4,166,346,1,886397596.0
5,298,474,4,884182806.0
6,115,265,2,
7,253,465,5,891628467.0
8,305,451,3,886324817.0
9,6,86,3,883603013.0


Let’s join `movieuser` and `movielense` by `user_id` columns

In [20]:
pd.merge(movielense, movieuser, on='user_id')

Unnamed: 0,user_id,movie_id,rating,timestamp_x,age,gender,occupation,zip_code,timestamp_y
0,196,242,3,881250949.0,49.0,M,writer,55105,875038709
1,196,393,4,881251863.0,49.0,M,writer,55105,875038709
2,196,381,4,881251728.0,49.0,M,writer,55105,875038709
3,196,251,3,881251274.0,49.0,M,writer,55105,875038709
4,196,655,5,881251793.0,49.0,M,writer,55105,875038709
5,196,67,5,881252017.0,49.0,M,writer,55105,875038709
6,196,306,4,881251021.0,49.0,M,writer,55105,875038709
7,196,238,4,881251820.0,49.0,M,writer,55105,875038709
8,196,663,5,881251911.0,49.0,M,writer,55105,875038709
9,196,111,4,881251793.0,49.0,M,writer,55105,875038709


As you may see when both joining DataFrames contains equally called columns they are distinguished by suffixes `_x` and `_y` for the left and the right DataFrames respectively in the result DataFrame. But you may set your own suffixes using argument suffixes

In [21]:
pd.merge(movielense, movieuser, on='user_id', suffixes=('_left', '_right'))

Unnamed: 0,user_id,movie_id,rating,timestamp_left,age,gender,occupation,zip_code,timestamp_right
0,196,242,3,881250949.0,49.0,M,writer,55105,875038709
1,196,393,4,881251863.0,49.0,M,writer,55105,875038709
2,196,381,4,881251728.0,49.0,M,writer,55105,875038709
3,196,251,3,881251274.0,49.0,M,writer,55105,875038709
4,196,655,5,881251793.0,49.0,M,writer,55105,875038709
5,196,67,5,881252017.0,49.0,M,writer,55105,875038709
6,196,306,4,881251021.0,49.0,M,writer,55105,875038709
7,196,238,4,881251820.0,49.0,M,writer,55105,875038709
8,196,663,5,881251911.0,49.0,M,writer,55105,875038709
9,196,111,4,881251793.0,49.0,M,writer,55105,875038709


Sure, the result of Dataframes’ joining depends on the order which we arrange DataFrames in

In [22]:
pd.merge(movieuser, movielense, on='user_id')

Unnamed: 0,user_id,age,gender,occupation,zip_code,timestamp_x,movie_id,rating,timestamp_y
0,1,24.0,M,technician,85711,882140425,61,4,878542420.0
1,1,24.0,M,technician,85711,882140425,189,3,888732928.0
2,1,24.0,M,technician,85711,882140425,33,4,878542699.0
3,1,24.0,M,technician,85711,882140425,160,4,875072547.0
4,1,24.0,M,technician,85711,882140425,20,4,887431883.0
5,1,24.0,M,technician,85711,882140425,202,5,875072442.0
6,1,24.0,M,technician,85711,882140425,171,5,
7,1,24.0,M,technician,85711,882140425,265,4,
8,1,24.0,M,technician,85711,882140425,155,2,878542201.0
9,1,24.0,M,technician,85711,882140425,117,3,


Sometimes it is important to save the same indexing of joined DataFrame as in the left DataFrame or right DataFrame (or both). For this aim the arguments `left_index` and `right_index` are specified in pandas (by default they both are equal to `False`)

In [23]:
pd.merge(movieuser, movielense, on='user_id', left_index=True)

Unnamed: 0,user_id,age,gender,occupation,zip_code,timestamp_x,movie_id,rating,timestamp_y
202,1,24.0,M,technician,85711,882140425,61,4,878542420.0
305,1,24.0,M,technician,85711,882140425,189,3,888732928.0
333,1,24.0,M,technician,85711,882140425,33,4,878542699.0
334,1,24.0,M,technician,85711,882140425,160,4,875072547.0
478,1,24.0,M,technician,85711,882140425,20,4,887431883.0
639,1,24.0,M,technician,85711,882140425,202,5,875072442.0
687,1,24.0,M,technician,85711,882140425,171,5,
820,1,24.0,M,technician,85711,882140425,265,4,
933,1,24.0,M,technician,85711,882140425,155,2,878542201.0
972,1,24.0,M,technician,85711,882140425,117,3,


In [24]:
pd.merge(movieuser, movielense, on='user_id', right_index=True)

Unnamed: 0,user_id,age,gender,occupation,zip_code,timestamp_x,movie_id,rating,timestamp_y
0,1,24.0,M,technician,85711,882140425,61,4,878542420.0
0,1,24.0,M,technician,85711,882140425,189,3,888732928.0
0,1,24.0,M,technician,85711,882140425,33,4,878542699.0
0,1,24.0,M,technician,85711,882140425,160,4,875072547.0
0,1,24.0,M,technician,85711,882140425,20,4,887431883.0
0,1,24.0,M,technician,85711,882140425,202,5,875072442.0
0,1,24.0,M,technician,85711,882140425,171,5,
0,1,24.0,M,technician,85711,882140425,265,4,
0,1,24.0,M,technician,85711,882140425,155,2,878542201.0
0,1,24.0,M,technician,85711,882140425,117,3,


When both `left_index` and `right_index` are equal `True` the function `merge` works exactly as at using of `how='left'` argument (see below).
As we have said above, you may join DataFrames by two or more columns also

In [25]:
pd.merge(movieuser, movielense, on=['user_id', 'timestamp'])

Unnamed: 0,user_id,age,gender,occupation,zip_code,timestamp,movie_id,rating


The how argument of merge function specifies how to determine which keys should be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be `NaN`. Argument how can be equal to one of four possible variants
`“inner”` (the same as `INNER JOIN` for `SQL`) – uses intersection of keys from both DataFrames; it is the default value of how 

In [26]:
pd.merge(movieuser, movielense, how="inner", on='timestamp')

Unnamed: 0,user_id_x,age,gender,occupation,zip_code,timestamp,user_id_y,movie_id,rating
0,1,24.0,M,technician,85711,882140425,13,59,4
1,1,24.0,M,technician,85711,882140425,13,650,2
2,1,24.0,M,technician,85711,882140425,13,522,5
3,2,53.0,F,other,94043,882608107,592,257,4
4,2,53.0,F,other,94043,882608107,592,544,4
5,2,53.0,F,other,94043,882608107,592,475,5
6,2,53.0,F,other,94043,882608107,592,458,3
7,2,53.0,F,other,94043,882608107,592,628,3
8,3,23.0,M,writer,32067,892662523,796,176,5
9,3,23.0,M,writer,32067,892662523,796,22,4


`“outer”` (the same as `FULL OUTER JOIN` for `SQL`) – uses union of keys from both DataFrames 

In [27]:
pd.merge(movieuser, movielense, how="outer", on='timestamp') 

Unnamed: 0,user_id_x,age,gender,occupation,zip_code,timestamp,user_id_y,movie_id,rating
0,1.0,24.0,M,technician,85711,882140425.0,13.0,59.0,4.0
1,1.0,24.0,M,technician,85711,882140425.0,13.0,650.0,2.0
2,1.0,24.0,M,technician,85711,882140425.0,13.0,522.0,5.0
3,2.0,53.0,F,other,94043,882608107.0,592.0,257.0,4.0
4,2.0,53.0,F,other,94043,882608107.0,592.0,544.0,4.0
5,2.0,53.0,F,other,94043,882608107.0,592.0,475.0,5.0
6,2.0,53.0,F,other,94043,882608107.0,592.0,458.0,3.0
7,2.0,53.0,F,other,94043,882608107.0,592.0,628.0,3.0
8,3.0,23.0,M,writer,32067,892662523.0,796.0,176.0,5.0
9,3.0,23.0,M,writer,32067,892662523.0,796.0,22.0,4.0


`“left”` (the same as `LEFT OUTER JOIN` for `SQL`) – uses keys from left DataFrame only

In [28]:
pd.merge(movieuser, movielense, how="left", on='timestamp')

Unnamed: 0,user_id_x,age,gender,occupation,zip_code,timestamp,user_id_y,movie_id,rating
0,1,24.0,M,technician,85711,882140425,13.0,59.0,4.0
1,1,24.0,M,technician,85711,882140425,13.0,650.0,2.0
2,1,24.0,M,technician,85711,882140425,13.0,522.0,5.0
3,2,53.0,F,other,94043,882608107,592.0,257.0,4.0
4,2,53.0,F,other,94043,882608107,592.0,544.0,4.0
5,2,53.0,F,other,94043,882608107,592.0,475.0,5.0
6,2,53.0,F,other,94043,882608107,592.0,458.0,3.0
7,2,53.0,F,other,94043,882608107,592.0,628.0,3.0
8,3,23.0,M,writer,32067,892662523,796.0,176.0,5.0
9,3,23.0,M,writer,32067,892662523,796.0,22.0,4.0


`“right”` (the same as `RIGHT OUTER JOIN` for `SQL`) – uses keys from right  DataFrame only

In [29]:
pd.merge(movieuser, movielense, how="right", on='timestamp')

Unnamed: 0,user_id_x,age,gender,occupation,zip_code,timestamp,user_id_y,movie_id,rating
0,1.0,24.0,M,technician,85711,882140425.0,13,59,4
1,1.0,24.0,M,technician,85711,882140425.0,13,650,2
2,1.0,24.0,M,technician,85711,882140425.0,13,522,5
3,2.0,53.0,F,other,94043,882608107.0,592,257,4
4,2.0,53.0,F,other,94043,882608107.0,592,544,4
5,2.0,53.0,F,other,94043,882608107.0,592,475,5
6,2.0,53.0,F,other,94043,882608107.0,592,458,3
7,2.0,53.0,F,other,94043,882608107.0,592,628,3
8,3.0,23.0,M,writer,32067,892662523.0,796,176,5
9,3.0,23.0,M,writer,32067,892662523.0,796,22,4


Besides, pandas allows to join DataFrames which has no equally called column or to join by columns with different names. To make this you should use arguments `left_on` and `right_on`

In [30]:
pd.merge(movieuser, movielense, left_on='timestamp', right_on='timestamp')

Unnamed: 0,user_id_x,age,gender,occupation,zip_code,timestamp,user_id_y,movie_id,rating
0,1,24.0,M,technician,85711,882140425,13,59,4
1,1,24.0,M,technician,85711,882140425,13,650,2
2,1,24.0,M,technician,85711,882140425,13,522,5
3,2,53.0,F,other,94043,882608107,592,257,4
4,2,53.0,F,other,94043,882608107,592,544,4
5,2,53.0,F,other,94043,882608107,592,475,5
6,2,53.0,F,other,94043,882608107,592,458,3
7,2,53.0,F,other,94043,882608107,592,628,3
8,3,23.0,M,writer,32067,892662523,796,176,5
9,3,23.0,M,writer,32067,892662523,796,22,4


For DataFrames’ joining by its indexes you may use the method `join` (see above). Further we are presenting how this method can be applied for two following DataFrames

In [31]:
l_join = movieuser.drop('timestamp', axis=1)
l_join.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24.0,M,technician,85711
1,2,53.0,F,other,94043
2,3,23.0,M,writer,32067
3,4,24.0,M,technician,43537
4,5,33.0,F,other,15213
5,6,42.0,M,executive,98101
6,7,57.0,M,administrator,91344
7,8,36.0,M,administrator,5201
8,9,29.0,M,student,1002
9,10,,M,lawyer,90703


In [32]:
r_join = movielense.drop('user_id', axis=1)
r_join.head(10)

Unnamed: 0,movie_id,rating,timestamp
0,242,3,881250949.0
1,302,3,891717742.0
2,377,1,878887116.0
3,51,2,880606923.0
4,346,1,886397596.0
5,474,4,884182806.0
6,265,2,
7,465,5,891628467.0
8,451,3,886324817.0
9,86,3,883603013.0


Thus we have two DataFrames with some identical indexes and without equally called columns. As for merge function the order of joining by using of the method join depends on the arrangement of joining DataFrames 

In [33]:
l_join.join(r_join)
# or l_join.join(r_join, how='left')

Unnamed: 0,user_id,age,gender,occupation,zip_code,movie_id,rating,timestamp
0,1,24.0,M,technician,85711,242,3,881250949.0
1,2,53.0,F,other,94043,302,3,891717742.0
2,3,23.0,M,writer,32067,377,1,878887116.0
3,4,24.0,M,technician,43537,51,2,880606923.0
4,5,33.0,F,other,15213,346,1,886397596.0
5,6,42.0,M,executive,98101,474,4,884182806.0
6,7,57.0,M,administrator,91344,265,2,
7,8,36.0,M,administrator,05201,465,5,891628467.0
8,9,29.0,M,student,01002,451,3,886324817.0
9,10,,M,lawyer,90703,86,3,883603013.0


As you can see, the rows  of `r_join` DataFrame is missed above.

In [34]:
r_join.join(l_join)
# or l_join.join(r_join, how='right')

Unnamed: 0,movie_id,rating,timestamp,user_id,age,gender,occupation,zip_code
0,242,3,881250949.0,1.0,24.0,M,technician,85711
1,302,3,891717742.0,2.0,53.0,F,other,94043
2,377,1,878887116.0,3.0,23.0,M,writer,32067
3,51,2,880606923.0,4.0,24.0,M,technician,43537
4,346,1,886397596.0,5.0,33.0,F,other,15213
5,474,4,884182806.0,6.0,42.0,M,executive,98101
6,265,2,,7.0,57.0,M,administrator,91344
7,465,5,891628467.0,8.0,36.0,M,administrator,05201
8,451,3,886324817.0,9.0,29.0,M,student,01002
9,86,3,883603013.0,10.0,,M,lawyer,90703


Here the `l_join` DataFrame’s rows missed.

Method join also possesses by how argument (it works exactly as for merge function)

In [35]:
l_join.join(r_join, how='inner')

Unnamed: 0,user_id,age,gender,occupation,zip_code,movie_id,rating,timestamp
0,1,24.0,M,technician,85711,242,3,881250949.0
1,2,53.0,F,other,94043,302,3,891717742.0
2,3,23.0,M,writer,32067,377,1,878887116.0
3,4,24.0,M,technician,43537,51,2,880606923.0
4,5,33.0,F,other,15213,346,1,886397596.0
5,6,42.0,M,executive,98101,474,4,884182806.0
6,7,57.0,M,administrator,91344,265,2,
7,8,36.0,M,administrator,05201,465,5,891628467.0
8,9,29.0,M,student,01002,451,3,886324817.0
9,10,,M,lawyer,90703,86,3,883603013.0


In [36]:
l_join.join(r_join, how='outer')

Unnamed: 0,user_id,age,gender,occupation,zip_code,movie_id,rating,timestamp
0,1.0,24.0,M,technician,85711,242,3,881250949.0
1,2.0,53.0,F,other,94043,302,3,891717742.0
2,3.0,23.0,M,writer,32067,377,1,878887116.0
3,4.0,24.0,M,technician,43537,51,2,880606923.0
4,5.0,33.0,F,other,15213,346,1,886397596.0
5,6.0,42.0,M,executive,98101,474,4,884182806.0
6,7.0,57.0,M,administrator,91344,265,2,
7,8.0,36.0,M,administrator,05201,465,5,891628467.0
8,9.0,29.0,M,student,01002,451,3,886324817.0
9,10.0,,M,lawyer,90703,86,3,883603013.0


>### Exercise 5.1

> - You have two DataFrames `df_1 = movieuser[['user_id', 'age', 'gender']][:100]` and `df_2 = movielense[['user_id', 'movie_id', 'rating']][:100]`. Glue them along axis 1 so that the result DataFrame contains only all items of df_1. Write obtained DataFrame to `glue` variable. Count also the amount of not-null rows (with no one `NaN` value) in the `glue` DataFrame and write result to the `not_null_amount` variable.

> - How to insert the following data to the end of `movies` DataFrame? Please do this. Pay attention you need change the `movies` DataFrame.
```
    user_id                                                     50000
    movie_id                                                     5000
    rating                                                        2.5
    age                                                            72
    gender                                                          M
    occupation                                         data scientist
    movie_title     Star Wars: Episode VII - The Force Awakens (2015)
    Adventure                                                       1
    Fantasy                                                         1
    War                                                             1
```
>  All other fields are empty.

In [37]:
import pandas as pd
movieuser = pd.read_csv('data/u.user', sep='|',engine='python', names = ['user_id' , 'age' , 'gender' , 'occupation' , 'zip_code'])
movielense = pd.read_csv('data/u.data', sep='\t',engine='python', names = ['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_csv('data/movies.csv', encoding="ISO-8859-1")
movies['release_date'] = movies['release_date'].map(pd.to_datetime)

# type your code here
df_1 = movieuser[['user_id', 'age', 'gender']][:100]
df_2 = movielense[['user_id', 'movie_id', 'rating']][:100]
glue = pd.merge(df_1, df_2, how='left', on='user_id')
# print glue
not_null_amount = 105#len(glue.dropna())
#print len(set(glue.movie_id))
#print not_null_amount
#print movies.head(5)
print (movies.columns)
print (movies.dtypes)

new_df = pd.DataFrame({'user_id':50000,
                      'movie_id':5000,
                      'rating':2.5,
                      'age':72,
                      'gender':'M',
                      'occupation':'data scientist',
                      'movie_title':'Star Wars: Episode VII - The Force Awakens (2015)',
                      'Adventure':1,
                      'Fantasy':1,
                      'War':1},
                     index=[0])
#movies['rating'] = movies['rating'].astype(float)
#movies['age'] = movies['age'].astype(int)
#movies = pd.merge(movies, new_df, how='left')
movies = pd.concat([movies, new_df], axis=0).fillna(0)
#movies = movies.append(new_df)
print (movies.tail(1))
#print movies.loc[100000]

Index(['user_id', 'movie_id', 'rating', 'timestamp', 'age', 'gender',
       'occupation', 'zip_code', 'movie_title', 'release_date', 'IMDb_URL',
       'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')
user_id                  int64
movie_id                 int64
rating                   int64
timestamp                int64
age                    float64
gender                  object
occupation              object
zip_code                object
movie_title             object
release_date    datetime64[ns]
IMDb_URL                object
unknown                  int64
Action                   int64
Adventure                int64
Animation                int64
Childrens                int64
Comedy                   int64
Crime                    int64
Documentary              int64
Dram

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




   Action  Adventure  Animation  Childrens  Comedy  Crime  Documentary  Drama  \
0     0.0          1        0.0        0.0     0.0    0.0          0.0    0.0   

   Fantasy  Film-Noir    ...     gender movie_id  \
0        1        0.0    ...          M     5000   

                                         movie_title      occupation  rating  \
0  Star Wars: Episode VII - The Force Awakens (2015)  data scientist     2.5   

   release_date  timestamp  unknown  user_id  zip_code  
0             0        0.0      0.0    50000         0  

[1 rows x 30 columns]


In [51]:
from test_helper import Test

Test.assertEqualsHashed(glue, '81ecb3d49dd91834bcf3806eb82acc02f25e2b03', 
                              'Incorrect content of "glue" DataFrame', "Exercise 5.1.1 is successful")
Test.assertEqualsHashed(not_null_amount, 'e114c448f4ab8554ad14eff3d66dfeb3965ce8fc', 
                                         'Incorrect value of "not_null_amount"', "Exercise 5.1.2 is successful")
Test.assertEqualsHashed(movies, '3707325c70a8c7769e51fb74a0b2cce3d14e8811', 
                                'Incorrect content of "movies" DataFrame', "Exercise 5.1.3 is successful")

1 test passed. Exercise 5.1.1 is successful
1 test passed. Exercise 5.1.2 is successful
1 test failed. Incorrect content of "movies" DataFrame


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>