# Module 2: Introduction to Numpy and Pandas

The following tutorial contains **examples of using the numpy and pandas library modules**. Read the step-by-step instructions below carefully. To execute the code, click on the cell and press the `SHIFT-ENTER` keys simultaneously.

## 2.2 Introduction to Pandas

Pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Pandas provide two convenient data structures for storing and manipulating data--`Series` and `DataFrame`. 
- A `Series` is similar to a one-dimensional array.
- A `DataFrame` is a tabular representation akin to a spreadsheet table. 

A `Series` is essentially a column, and a `DataFrame` is a multi-dimensional table made up of a collection of series.

![](series-dataframe2.png)

Fore more details, visit <https://pandas.pydata.org/pandas-docs/stable/index.html>.

To use the `pandas` package, just import!

```python
import pandas as pd
```
Almost all python users import `pandas` as `pd` because it is a convention.

`numpy` and `pandas` are used together in many scenarios, so import both packages for convenience.

In [1]:
import numpy as np
import pandas as pd

### 2.2.1 Series

- A Series object consists of a one-dimensional array of values, whose elements can be referenced using an index array (like list or tuple). 
- A Series object can be created from a list, a numpy array, or a Python dictionary. 
- You can apply most of the numpy functions on the Series object.

In [2]:
s = pd.Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5])   # creating a series from a list
print('Series, s =\n', s, '\n')

print('s.values =', s.values)     # display values of the Series
print('s.index =', s.index)       # display indices of the Series
print('s.dtype =', s.dtype)       # display the element type of the Series
print('s[0] =', s[0])             # display the first value

Series, s =
 0    3.1
1    2.4
2   -1.7
3    0.2
4   -2.9
5    4.5
dtype: float64 

s.values = [ 3.1  2.4 -1.7  0.2 -2.9  4.5]
s.index = RangeIndex(start=0, stop=6, step=1)
s.dtype = float64
s[0] = 3.1


In [3]:
s2 = pd.Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
              index=['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',]) # custom index
print('Series s2 =\n', s2, '\n')

print('s2.values =', s2.values)   # display values of the Series
print('s2.index =', s2.index)     # display indices of the Series
print('s2.dtype =', s2.dtype)     # display the element type of the Series
print('s2[0] =', s2[0])             # display the first value
print('s2["Jan 1"] =', s2["Jan 1"]) # display the first value

Series s2 =
 Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64 

s2.values = [ 1.2  2.5 -2.2  3.1 -0.8 -3.2]
s2.index = Index(['Jan 1', 'Jan 2', 'Jan 3', 'Jan 4', 'Jan 5', 'Jan 6'], dtype='object')
s2.dtype = float64
s2[0] = 1.2
s2["Jan 1"] = 1.2


In [4]:
capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}

s3 = pd.Series(capitals)   # creating a series from dictionary object
print('Series s3 =\n', s3, '\n')

print('s3.values =', s3.values)   # display values of the Series
print('s3.index=', s3.index)      # display indices of the Series
print('s3.dtype =', s3.dtype)     # display the element type of the Series
print('s3[0] =', s3[0])              # display the first value
print('s3["MI"] =', s3["MI"])  # display the first value

Series s3 =
 MI       Lansing
CA    Sacramento
TX        Austin
MN       St Paul
dtype: object 

s3.values = ['Lansing' 'Sacramento' 'Austin' 'St Paul']
s3.index= Index(['MI', 'CA', 'TX', 'MN'], dtype='object')
s3.dtype = object
s3[0] = Lansing
s3["MI"] = Lansing


In [5]:
# Accessing elements of a Series
print('s3[2]=', s3[2])        # display third element of the Series
print('s3[\'TX\']=', s3['TX'])   # indexing element of a Series 

print('\ns3[1:3]=')             # display a slice of the Series
print(s3[1:3])
print('\ns3.iloc([1:3])=')      # display a slice of the Series
print(s3.iloc[1:3])

s3[2]= Austin
s3['TX']= Austin

s3[1:3]=
CA    Sacramento
TX        Austin
dtype: object

s3.iloc([1:3])=
CA    Sacramento
TX        Austin
dtype: object


There are various functions available to find the number of elements in a Series. Result of the function depends on whether null elements are included. 

In [6]:
print('Shape of s3 =', s3.shape)   # get the dimension of the Series
print('Size of s3 =', s3.size)     # get the number of elements of the Series
print('Count of s3 =', s3.count()) # get the number of non-null elements of the Series

Shape of s3 = (4,)
Size of s3 = 4
Count of s3 = 4


NaN(not-a-number) values (`np.nan`) can be inserted.

In [7]:
s3['US'] = np.nan
print('Series s3 =\n', s3, '\n')

Series s3 =
 MI       Lansing
CA    Sacramento
TX        Austin
MN       St Paul
US           NaN
dtype: object 



A boolean filter can be used to select elements of a Series.

In [8]:
print(s2[s2 > 0])   # applying filter to select positive elements of the Series
print()

print(s3[s3.isna()])   # applying filter to select not-a-number elements of the Series

Jan 1    1.2
Jan 2    2.5
Jan 4    3.1
dtype: float64

US    NaN
dtype: object


Scalar operations can be performed on elements of a numeric Series

In [9]:
print('s2 + 4 =\n', s2 + 4, '\n')       
print('s2 / 4 =\n', s2 / 4)                 

s2 + 4 =
 Jan 1    5.2
Jan 2    6.5
Jan 3    1.8
Jan 4    7.1
Jan 5    3.2
Jan 6    0.8
dtype: float64 

s2 / 4 =
 Jan 1    0.300
Jan 2    0.625
Jan 3   -0.550
Jan 4    0.775
Jan 5   -0.200
Jan 6   -0.800
dtype: float64


In [10]:
# addition of two Series
mine = pd.Series([10,20,30], index=['naver','skt','kt'])
wife = pd.Series([10,30,20], index=['kt','naver','skt'])

family = mine + wife
print(family)

kt       40
naver    40
skt      40
dtype: int64


Numpy functions can be applied to pandas Series. 

In [11]:
print('np.log(s3 + 4) =\n', np.log(s2 + 4), '\n')    # applying log function to a numeric Series
print('np.exp(s3 - 4) =\n', np.exp(s2 - 4), '\n')    # applying exponent function to a numeric Series

np.log(s3 + 4) =
 Jan 1    1.648659
Jan 2    1.871802
Jan 3    0.587787
Jan 4    1.960095
Jan 5    1.163151
Jan 6   -0.223144
dtype: float64 

np.exp(s3 - 4) =
 Jan 1    0.060810
Jan 2    0.223130
Jan 3    0.002029
Jan 4    0.406570
Jan 5    0.008230
Jan 6    0.000747
dtype: float64 



The value_counts() function can be used for tabulating the counts of each discrete value in the Series. 

In [12]:
colors = pd.Series(['red', 'blue', 'blue', 'yellow', 'red', 'green', 'blue', np.nan])
print('colors =\n', colors, '\n')

print('colors.value_counts() =\n', colors.value_counts())

colors =
 0       red
1      blue
2      blue
3    yellow
4       red
5     green
6      blue
7       NaN
dtype: object 

colors.value_counts() =
 blue      3
red       2
yellow    1
green     1
dtype: int64


### 2.2.2 DataFrame

![](dataframe.png)

- A DataFrame object is a tabular, spreadsheet-like data structure containing a collection of columns, each of which can be of different types (numeric, string, boolean, etc). Unlike Series, a DataFrame has distinct row and column indices. 
- There are many ways to create a DataFrame object (e.g., from a dictionary, list of tuples, or even numpy's ndarrays).

In [13]:
cars = {'maker': ['Ford', 'Honda', 'Toyota', 'Tesla'],
        'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
        'price': [27595, 23570, 23495, 68000]}          
carData = pd.DataFrame(cars)            # creating DataFrame from dictionary
carData                              # display the table

Unnamed: 0,maker,model,price
0,Ford,Taurus,27595
1,Honda,Accord,23570
2,Toyota,Camry,23495
3,Tesla,Model S,68000


In [14]:
print('carData.index =', carData.index)         # print the row indices
print('carData.columns =', carData.columns)     # print the column indices

carData.index = RangeIndex(start=0, stop=4, step=1)
carData.columns = Index(['maker', 'model', 'price'], dtype='object')


Inserting columns to an existing dataframe

In [15]:
carData2 = pd.DataFrame(cars, index = [1,2,3,4])  # change the row index
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A'] # new column
carData2['year'] = 2018    # add new column with same value
carData2['expensive'] = carData2['price'] > 50000 # add new boolean column
carData2                   # display table

Unnamed: 0,maker,model,price,dealership,year,expensive
1,Ford,Taurus,27595,Courtesy Ford,2018,False
2,Honda,Accord,23570,Capital Honda,2018,False
3,Toyota,Camry,23495,Spartan Toyota,2018,False
4,Tesla,Model S,68000,,2018,True


In [16]:
# delete the column 'year'
del carData2['year']
carData2

Unnamed: 0,maker,model,price,dealership,expensive
1,Ford,Taurus,27595,Courtesy Ford,False
2,Honda,Accord,23570,Capital Honda,False
3,Toyota,Camry,23495,Spartan Toyota,False
4,Tesla,Model S,68000,,True


Creating DataFrame from a list of tuples.

In [17]:
tuplelist = [(2011,45.1,32.4),
             (2012,42.4,34.5),
             (2013,47.2,39.2),
             (2014,44.2,31.4),
             (2015,39.9,29.8),
             (2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = pd.DataFrame(tuplelist, columns=columnNames)
weatherData

Unnamed: 0,year,temp,precip
0,2011,45.1,32.4
1,2012,42.4,34.5
2,2013,47.2,39.2
3,2014,44.2,31.4
4,2015,39.9,29.8
5,2016,41.5,36.7


Creating DataFrame from numpy ndarray

In [18]:
npdata = np.random.randn(5,3)  # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
indices = pd.date_range('20220101', periods=5) # get a series containing consecutive 5 days since '20220101'
data = pd.DataFrame(npdata, columns=columnNames, index=indices)
data

Unnamed: 0,x1,x2,x3
2022-01-01,-0.832605,-0.312064,-0.900982
2022-01-02,-0.073197,-1.587606,1.84653
2022-01-03,-0.826297,-0.985873,-0.645134
2022-01-04,1.199016,-2.424457,-0.944754
2022-01-05,0.695053,-1.186878,-0.089504


There are many ways to access elements of a DataFrame object.

In [19]:
# accessing an entire column will return a Series object

print(data['x2']) # get column `x2`
# print(data.x2)    # same as above
print(type(data['x2'])) # which is a `Series` object

2022-01-01   -0.312064
2022-01-02   -1.587606
2022-01-03   -0.985873
2022-01-04   -2.424457
2022-01-05   -1.186878
Freq: D, Name: x2, dtype: float64
<class 'pandas.core.series.Series'>


In [20]:
# accessing an entire row will return a Series object

print('Row 3 of data table:')
print(data.iloc[2])       # returns the 3rd row of DataFrame
print(type(data.iloc[2]))

print('\nRow 3 of car data table:')
print(carData2.iloc[2])   # row contains objects of different types

Row 3 of data table:
x1   -0.826297
x2   -0.985873
x3   -0.645134
Name: 2022-01-03 00:00:00, dtype: float64
<class 'pandas.core.series.Series'>

Row 3 of car data table:
maker                 Toyota
model                  Camry
price                  23495
dealership    Spartan Toyota
expensive              False
Name: 3, dtype: object


In [21]:
# accessing a specific element of the DataFrame

print('carData2 =\n', carData2)

print('\ncarData2.iloc[1,2] =', carData2.iloc[1,2])                # retrieving second row, third column
print('carData2.loc[1,\'model\'] =', carData2.loc[1,'model'])    # retrieving second row, column named 'model'

# accessing a slice of the DataFrame
print('\ncarData2.iloc[1:3,1:3]=')
print(carData2.iloc[1:3,1:3])

carData2 =
     maker    model  price      dealership  expensive
1    Ford   Taurus  27595   Courtesy Ford      False
2   Honda   Accord  23570   Capital Honda      False
3  Toyota    Camry  23495  Spartan Toyota      False
4   Tesla  Model S  68000             N/A       True

carData2.iloc[1,2] = 23570
carData2.loc[1,'model'] = Taurus

carData2.iloc[1:3,1:3]=
    model  price
2  Accord  23570
3   Camry  23495


In [22]:
print('carData2 =\n', carData2, '\n')

print('carData2.shape =', carData2.shape)
print('carData2.size =', carData2.size)

carData2 =
     maker    model  price      dealership  expensive
1    Ford   Taurus  27595   Courtesy Ford      False
2   Honda   Accord  23570   Capital Honda      False
3  Toyota    Camry  23495  Spartan Toyota      False
4   Tesla  Model S  68000             N/A       True 

carData2.shape = (4, 5)
carData2.size = 20


In [23]:
# selection and filtering

print('carData2 =\n', carData2, '\n')

print('carData2[carData2.price > 25000] =')  
print(carData2[carData2.price > 25000])

carData2 =
     maker    model  price      dealership  expensive
1    Ford   Taurus  27595   Courtesy Ford      False
2   Honda   Accord  23570   Capital Honda      False
3  Toyota    Camry  23495  Spartan Toyota      False
4   Tesla  Model S  68000             N/A       True 

carData2[carData2.price > 25000] =
   maker    model  price     dealership  expensive
1   Ford   Taurus  27595  Courtesy Ford      False
4  Tesla  Model S  68000            N/A       True


### 2.2.3 Arithmetic Operations

In [24]:
print(data)

print('\nData transpose operation: data.T')
print(data.T)    # transpose operation

print('\nAddition: data + 4')
print(data + 4)    # addition operation

print('\nMultiplication: data * 10')
print(data * 10)   # multiplication operation

                  x1        x2        x3
2022-01-01 -0.832605 -0.312064 -0.900982
2022-01-02 -0.073197 -1.587606  1.846530
2022-01-03 -0.826297 -0.985873 -0.645134
2022-01-04  1.199016 -2.424457 -0.944754
2022-01-05  0.695053 -1.186878 -0.089504

Data transpose operation: data.T
    2022-01-01  2022-01-02  2022-01-03  2022-01-04  2022-01-05
x1   -0.832605   -0.073197   -0.826297    1.199016    0.695053
x2   -0.312064   -1.587606   -0.985873   -2.424457   -1.186878
x3   -0.900982    1.846530   -0.645134   -0.944754   -0.089504

Addition: data + 4
                  x1        x2        x3
2022-01-01  3.167395  3.687936  3.099018
2022-01-02  3.926803  2.412394  5.846530
2022-01-03  3.173703  3.014127  3.354866
2022-01-04  5.199016  1.575543  3.055246
2022-01-05  4.695053  2.813122  3.910496

Multiplication: data * 10
                   x1         x2         x3
2022-01-01  -8.326050  -3.120639  -9.009824
2022-01-02  -0.731973 -15.876058  18.465305
2022-01-03  -8.262972  -9.858731  -6.451343

In [25]:
print('data =\n', data)

columnNames = ['x1','x2','x3']
data2 = pd.DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)

print('\ndata + data2 = ')
print(data.add(data2))

print('\ndata * data2 = ')
print(data.mul(data2))

data =
                   x1        x2        x3
2022-01-01 -0.832605 -0.312064 -0.900982
2022-01-02 -0.073197 -1.587606  1.846530
2022-01-03 -0.826297 -0.985873 -0.645134
2022-01-04  1.199016 -2.424457 -0.944754
2022-01-05  0.695053 -1.186878 -0.089504

data2 =
         x1        x2        x3
0 -0.473075  0.721669 -1.382434
1 -0.402766  1.556885  0.320216
2  0.301673  0.985663 -1.102808
3 -0.388761 -1.114125 -0.330898
4 -0.188826  1.065434  1.498022

data + data2 = 
                     x1  x2  x3
2022-01-01 00:00:00 NaN NaN NaN
2022-01-02 00:00:00 NaN NaN NaN
2022-01-03 00:00:00 NaN NaN NaN
2022-01-04 00:00:00 NaN NaN NaN
2022-01-05 00:00:00 NaN NaN NaN
0                   NaN NaN NaN
1                   NaN NaN NaN
2                   NaN NaN NaN
3                   NaN NaN NaN
4                   NaN NaN NaN

data * data2 = 
                     x1  x2  x3
2022-01-01 00:00:00 NaN NaN NaN
2022-01-02 00:00:00 NaN NaN NaN
2022-01-03 00:00:00 NaN NaN NaN
2022-01-04 00:00:00 NaN NaN NaN

In [26]:
print(data.abs())    # get the absolute value for each element

print('\nMaximum value per column:')
print(data.max())    # get maximum value for each column

print('\nMinimum value per row:')
print(data.min(axis=1))    # get minimum value for each row

print('\nSum of values per column:')
print(data.sum())    # get sum of values for each column

print('\nAverage value per row:')
print(data.mean(axis=1))    # get average value for each row

print('\nCalculate max - min per column')
f = lambda x: x.max() - x.min()
print(data.apply(f))

print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))

                  x1        x2        x3
2022-01-01  0.832605  0.312064  0.900982
2022-01-02  0.073197  1.587606  1.846530
2022-01-03  0.826297  0.985873  0.645134
2022-01-04  1.199016  2.424457  0.944754
2022-01-05  0.695053  1.186878  0.089504

Maximum value per column:
x1    1.199016
x2   -0.312064
x3    1.846530
dtype: float64

Minimum value per row:
2022-01-01   -0.900982
2022-01-02   -1.587606
2022-01-03   -0.985873
2022-01-04   -2.424457
2022-01-05   -1.186878
Freq: D, dtype: float64

Sum of values per column:
x1    0.161970
x2   -6.496878
x3   -0.733844
dtype: float64

Average value per row:
2022-01-01   -0.681884
2022-01-02    0.061909
2022-01-03   -0.819102
2022-01-04   -0.723398
2022-01-05   -0.193776
Freq: D, dtype: float64

Calculate max - min per column
x1    2.031621
x2    2.112393
x3    2.791285
dtype: float64

Calculate max - min per row
2022-01-01    0.588919
2022-01-02    3.434136
2022-01-03    0.340739
2022-01-04    3.623474
2022-01-05    1.881931
Freq: D, dtype: fl

The value_counts() function can also be applied to a pandas DataFrame

In [27]:
objects = {'shape': ['circle', 'square', 'square', 'square', 'circle', 'rectangle'],
           'color': ['red', 'red', 'red', 'blue', 'blue', 'blue']}

shapeData = pd.DataFrame(objects)
print('shapeData =\n', shapeData, '\n')

print('shapeData.value_counts() =\n', shapeData.value_counts().sort_values())

shapeData =
        shape color
0     circle   red
1     square   red
2     square   red
3     square  blue
4     circle  blue
5  rectangle  blue 

shapeData.value_counts() =
 shape      color
circle     blue     1
           red      1
rectangle  blue     1
square     blue     1
           red      2
dtype: int64


### 2.2.4 Data Join

There are actually four types of joins supported by the Pandas `merge` function. Here's how they are described by the documentation:

- inner: use intersection of keys from both frames (SQL: inner join)
- outer: use union of keys from both frames (SQL: full outer join)
- left: use only keys from left frame (SQL: left outer join)
- right: use only keys from right frame (SQL: right outer join)

The default is the "inner join", which was used when creating the movie_ratings DataFrame.

It's easiest to understand the different types by looking at some simple examples:

In [28]:
A = pd.DataFrame({'color': ['green', 'yellow', 'red'], 'num':[1, 2, 3]})
A

Unnamed: 0,color,num
0,green,1
1,yellow,2
2,red,3


In [29]:
B = pd.DataFrame({'color': ['green', 'yellow', 'pink'], 'size':['S', 'M', 'L']})
B

Unnamed: 0,color,size
0,green,S
1,yellow,M
2,pink,L


In [30]:
# inner join: use intersection of keys from both frames
pd.merge(A, B, how='inner')

Unnamed: 0,color,num,size
0,green,1,S
1,yellow,2,M


In [31]:
# outer join: use union of keys from both frames
pd.merge(A, B, how='outer')

Unnamed: 0,color,num,size
0,green,1.0,S
1,yellow,2.0,M
2,red,3.0,
3,pink,,L


In [32]:
# left join: use only keys from left frame
pd.merge(A, B, how='left')

Unnamed: 0,color,num,size
0,green,1,S
1,yellow,2,M
2,red,3,


In [33]:
# right join: use only keys from right frame
pd.merge(A, B, how='right')

Unnamed: 0,color,num,size
0,green,1.0,S
1,yellow,2.0,M
2,pink,,L


### Example
(From Kevin Markham's data science course)

Using the [MovieLens 100k data](http://grouplens.org/datasets/movielens/), let's create two DataFrames:

- **movies**: shows information about movies, namely a unique **movie_id** and its **title**
- **ratings**: shows the **rating** that a particular **user_id** gave to a particular **movie_id** at a particular **timestamp**

#### Movies

In [40]:
movie_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.item'
movie_cols = ['movie_id', 'title']
movies = pd.read_table(movie_url, encoding='ISO-8859-1', sep='|', header=None, names=movie_cols, usecols=[0, 1])
movies.head() # first 5 movies

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


#### Ratings

In [43]:
rating_url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.data'
rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(rating_url, sep='\t', header=None, names=rating_cols)
ratings.head() # first 5 ratings

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Let's pretend that you want to examine the `ratings` DataFrame, but you want to know the title of each movie rather than its `movie_id`. The best way to accomplish this objective is by "joining" (or "merging") the DataFrames using the Pandas `merge` function:

In [48]:
movie_ratings = pd.merge(movies, ratings)
movie_ratings.head()

Unnamed: 0,movie_id,title,user_id,rating,timestamp
0,1,Toy Story (1995),308,4,887736532
1,1,Toy Story (1995),287,5,875334088
2,1,Toy Story (1995),148,4,877019411
3,1,Toy Story (1995),280,4,891700426
4,1,Toy Story (1995),66,3,883601324


Here's what just happened:

- Pandas noticed that `movies` and `ratings` had one column in common, `movie_id`. This is the "key" on which the DataFrames will be joined.
- The first `movie_id` in `movies` is `1`. Thus, Pandas looked through every row in the `ratings` DataFrame, searching for a `movie_id` of 1. Every time it found such a row, it recorded the `user_id`, `rating`, and `timestamp` listed in that row. In this case, it found 452 matching rows.
- The second `movie_id` in `movies` is `2`. Again, Pandas did a search of ratings and found 131 matching rows.
- This process was repeated for all of the remaining rows in movies.

At the end of the process, the `movie_ratings` DataFrame is created, which contains the two columns from `movies` (`movie_id` and `title`) and the three other colums from `ratings` (`user_id`, `rating`, and `timestamp`).

- `movie_id` `1` and its title are listed 452 times, next to the `user_id`, `rating`, and `timestamp` for each of the 452 matching ratings.
- `movie_id` `2` and its title are listed 131 times, next to the `user_id`, `rating`, and `timestamp` for each of the 131 matching ratings.
- And so on, for every movie in the dataset.

In [49]:
print(movies.shape)
print(ratings.shape)
print(movie_ratings.shape)

(1682, 2)
(100000, 4)
(100000, 5)


Notice the shapes of the three DataFrames:

- There are 1682 rows in the `movies` DataFrame.
- There are 100000 rows in the `ratings` DataFrame.
- The `merge` function resulted in a `movie_ratings` DataFrame with 100000 rows, because every row from `ratings` matched a row from `movies`.
- The `movie_ratings` DataFrame has 5 columns, namely the 2 columns from `movies`, plus the 4 columns from `ratings`, minus the 1 column in common.

By default, the `merge` function joins the DataFrames using all column names that are in common (**`movie_id`**, in this case). The [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) explains how you can override this behavior.