### Series in pandas

Series are one-dimensional arrays.

In [None]:
import pandas as pd

In [None]:
#create a Series from a list
l = [0,1,2,3,4,5]
ser = pd.Series(l)
ser

In [None]:
#some additional info
print(type(ser))
print(ser.shape)

In [None]:
rangeSeries = pd.Series(range(-3, 4))
print(rangeSeries)

In [None]:
#this sets the indexes to the strings defined
names = ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG']
ser = pd.Series(l, index = names)
ser

In [None]:
#looking up data becomes easier now
print(ser['LAT'])

In [None]:
a = ser > 3
print(a)

In [None]:
#what's going on here?
print (ser[a])

### From dictionary to Series

This can be done simply with the following code, note that the keys in the dictionary become the index of the data:

In [None]:
# simple dictionary
data = {'NYT':0, 'WP':1, 'LAT':2, 'CNN':3, 'BBC':4, 'TG':5}

# Convert the dictionary into a pd.Series, and view it
media = pd.Series(data)
media

In [None]:
media.index = ['New York Times', 'Washington Post', 'Los Angeles Times', 'Cable News Network', 'British Broadcasting Company', 'The Guardian']

In [None]:
media

### Dataframes in pandas

Dataframes can be treated like tables and there is a number of ways to create them:


In [None]:
#if the lists are of equal length
data = {'medium': ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG'], 
        'articles': [2000, 8000, 3000, 500, 12000, 1000], 
        'reporters': [25, 76, 30, 10, 100, 15]}#number are fictional
df = pd.DataFrame(data)
df

If you have access to the dictionary, the column ordering can be defined:

In [None]:
df_new = pd.DataFrame(data, columns=['reporters', 'articles', 'medium'])
df_new

In [None]:
#adding a new column
df_new['long name'] = ['New York Times', 'Washington Post', 'Los Angeles Times', 'Cable News Network', 'British Broadcasting Company', 'The Guardian']

In [None]:
#deleting a column
del df_new['medium']
df_new

In [None]:
#transpose
df_new.T

In [None]:
df_new.T.T

### Indexing in Pandas

The __iloc__ function is useful to slice parts from a DataFrame. It behaves differently, depending on whether we pass ranges or index elements directly.

In [None]:
df_new

In [None]:
res = df_new.iloc[-2:]
res

In [None]:
res = df_new.iloc[0,2]
res

As we can see, passing the range __[0:n]__ extracts the first n-1 rows and using __[0,2]__ accesses the third element in the  first row. 

<div class="alert alert-block alert-warning">
<b>Note:</b> Notice the difference in data types - in the former case we retrieve a DataFrame as a result while in the latter case we have a string!
</div>

We can also use this to change values in the DataFrame directly:

In [None]:
df_new.iloc[0,2] = 'The New York Times'
df_new

In [None]:
#df_new.loc['2000']

### Extracting Series from Pandas DataFrames

Using __iloc__ we are able to extract Series from DataFrames. The following two cells demontrate this effect:

In [None]:
ser = df_new.iloc[:,2]
ser

In [None]:
ser2 = df_new.iloc[1:4,2]
ser2

We can extract a row from a DataFrame using the __loc__ function and passing along a label which needs to match some value in the first column of the DataFrame:

In [None]:
res_1 = df_new.loc[1]
res_1

This results in a Series and can be used to extract new values, e.g.:

In [None]:
res_1['reporters']

Passing along a list of values works too, however this results in a DataFrame: 

In [None]:
res_2 = df_new.loc[[1,2]]
res_2

We can also pass ranges to __loc__:

In [None]:
resdf = df_new.iloc[1:4]

In [None]:
resdf

In [None]:
res_3 = df_new.loc[1:4]
res_3

And this also works for the columns:

In [None]:
res_4 = df_new.loc[1:4, 'reporters':'articles']
res_4

Passing a single element as the list parameter results in a DataFrame:

In [None]:
res_5 = df_new.loc[[1]]
res_5

Sometimes, we want to change the index of our DataFrame - this can be done like this:

In [None]:
res_3.set_index('long name', inplace=True)
res_3.loc['Washington Post':]#todo

In [None]:
res_3

### Logical indexing in DataFrames

With the help of [logical indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) we can retrieve data from our DataFrames more easily. E.g. we can pass arrays or Series of True/False values to the __.loc__ indexer to select those values where the Series resolves in __True__. 

In [None]:
#let's create some larger artificial data for the sake of simplicity
datalarge = {'medium': ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG','NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG'], 
        'articles': [2000, 8000, 3000, 500, 12000, 1000, 3000, 8000, 5000, 500, 1000, 1000], 
        'reporters': [25, 76, 30, 10, 100, 15, 26, 71, 10, 10, 101, 15],
        'sections': [13, 25, 23, 22, 12, 3, 6, 10, 23 ,19, 18, 6], 
        'articles_per_week': [200, 250, 190, 222, 120, 300, 506, 116, 213 ,119, 418, 61], 
        }#number are fictional
dflarge = pd.DataFrame(datalarge)
dflarge

In [None]:
dflarge.loc[dflarge['medium'] == 'NYT']

In [None]:
tmp = (dflarge['medium'] == 'NYT')
print(type(tmp))
print(tmp)

Several things are to be noticed in this example:
- the __dflarge['medium'] == 'NYT'__ statement returns a Pandas Series of True/False values
- we can use this Series as input for the __loc__ statement to retrieve all the rows where our statement matches the desired result
- in our case we were looking for all rows containing NYT as the medium

In [None]:
#this works also for the larger than / smaller than operator
dflarge.loc[dflarge['articles'] > 2000]

In [None]:
#again, if you want to filter out specific columns, use the second argument
dflarge.loc[dflarge['articles'] > 2000, ['medium','articles']]

In [None]:
df_new

In [None]:
#Setting the values in a column using the.loc indexer
df_new.loc['Washington Post','reporters'] = 80
df_new

### Tasks
Have a look at the comments in the following cells. For each comment, use a new cell and subsequently solve the task demanded in the respective cell. Use the data from this DataFrame and make use of the __.loc__ indexer in each task:

In [None]:
dflarge

#### Task 1:
Select rows with medium LAT and all columns between 'articles' and 'sections'

In [None]:
dflarge.loc[dflarge['medium'] == 'LAT', 'articles':'sections']

#### Task 2:
Select rows where the medium column ends with 'T'

In [None]:
dflarge.loc[dflarge['medium'].str.endswith('T')]

#### Task 3:
Select rows with medium equal to the values in this list: ['BBC', 'CNN', 'WP']

In [None]:
dflarge.loc[dflarge['medium'].isin(['BBC', 'CNN', 'WP'])] 

#### Task 4:
Select rows with medium WP and 3000 articles or more

In [None]:
dflarge.loc[dflarge['medium'].str.endswith('WP') & (dflarge['reporters'] >= 75)]

#### Task 5:
Select rows with id column between 2 and 5, and just return 'reporters' and 'sections' columns

In [None]:
dflarge.loc[2:5, ['reporters', 'sections']] 

#### Task 6:
Select rows where the length of the medium name is 3 letters by making use of a lambda function:

In [None]:
dflarge.loc[dflarge['medium'].apply(lambda x: len(x) == 3)]

Form a separate variable 'idx' with your selections from the lambda function from the cell above

In [None]:
idx = dflarge['medium'].apply(lambda x: len(x) == 3)
idx

Select only the True values in 'idx' and only the 3 columns 'reporters', 'sections', 'articles_per_week':

In [None]:
dflarge.loc[idx, ['reporters', 'sections', 'articles_per_week']]