# Task 3. Answering student questions

How would you answer the student's question below? Your task is to get your message across in such a way that a beginner can understand your explanation. You can do this any way you want (pictures, GIFs, metaphors, anything) so long as it makes your explanation clear. Indicate how much time you spent completing this task.

### > **What is the difference between DataFrame and Series?**


A: Simply put, we can say that a DataFrame is a *two-dimensional structure* made by the **combination** of two or more Series, like a Excel spreadsheet with rows and columns. 

We can think of Series as a one-dimensional array of values that stores values of the same data type (integer, strings, booleans, etc.) and which values can be associated to a label or an index.

Let's look at one example by importing the Pandas (Panel Data) library and using one of the datasets available for event_data.csv dataset.

In [1]:
import pandas as pd

In [20]:
df = pd.read_csv('event_data.csv',nrows=500) # nrows allows us to get only the first 500 values of this dataset for a faster import

Each column (user_id, event_date, event_type, purchase_amount) is a Series, that together make the DataFrame.

In [3]:
df

Unnamed: 0,user_id,event_date,event_type,purchase_amount
0,c40e6a,2019-07-29 00:02:15,registration,
1,a2b682,2019-07-29 00:04:46,registration,
2,9ac888,2019-07-29 00:13:22,registration,
3,93ff22,2019-07-29 00:16:47,registration,
4,65ef85,2019-07-29 00:19:23,registration,
...,...,...,...,...
495,18ee16,2019-07-30 09:28:27,registration,
496,188815,2019-07-30 09:29:07,registration,
497,f14c37,2019-07-30 09:29:59,registration,
498,370fd7,2019-07-30 09:30:55,registration,


By calling only one column, we can see that the Series also have a index number that identifies the position inside the array.

In [7]:
df['user_id']

0      c40e6a
1      a2b682
2      9ac888
3      93ff22
4      65ef85
        ...  
495    18ee16
496    188815
497    f14c37
498    370fd7
499    d52f25
Name: user_id, Length: 500, dtype: object

This means that we can access this array by calling it's position.

In [14]:
df['user_id'][0]

'c40e6a'

We can also perform string, mathematical and logical operations in the Series.

In [11]:
df['user_id'].str.upper()

0      C40E6A
1      A2B682
2      9AC888
3      93FF22
4      65EF85
        ...  
495    18EE16
496    188815
497    F14C37
498    370FD7
499    D52F25
Name: user_id, Length: 500, dtype: object

But, we could not do this operation to the entirety of the DataFrame. We should first specify which columns (or Series) we would like to.

In [12]:
df.str.upper()

AttributeError: 'DataFrame' object has no attribute 'str'

# Task 4.

You are given two random variables X and Y

E(X) = 0.5, Var(X)=2

E(Y) = 7, Var (Y) = 3.5

cov(X,Y) = 0,8

Find the variance of the random variable Z = 2X - 3Y


We can say that the variance of *aX + bY* is given by the formula: 

`Var(aX + bY) = a²*Var(X) + b²*Var(Y) + 2ab*cov(X, Y)`

Calculating with Python:

In [15]:
a = 2
b = -3
var_X = 2
var_Y = 3.5
cov_XY = 0.8

In [16]:
var_Z = (a**2) * var_X + (b**2) * var_Y + 2 * a * b * cov_XY

In [19]:
print(f'The variance of Z is {var_Z}')

The variance of Z is 29.9
