# Python for Psychologists - Session 3

## handling data with dataframes & pandas

### handling data 

In the last two sessions you learned about the basic principles, data types, variables and how to handle them ... but **most** of the time we do not just work with single list, tuples or whatsoever, but with a bunch of data arranged in logfiles, tables, .csv files ..


Today we learn about using **pandas** to ... 

![pandasUrl](https://media.giphy.com/media/fAaBpMgGuyf96/giphy.gif "pandas")



... well to actually handle our data. **Pandas** is your data's home and we can get familiar with our data by cleaning, transforming and analyzing it through pandas.

For getting started, we need to `import pandas as pd ` to use all its provided features. We use ***pd*** as an abbreviation, since we are a bit lazy here :) 

In [None]:
import pandas as pd 

Pandas has two core components, i.e., ***series* and *dataframes***. A series is basically a single (labeled) 
column, whereas a dataframes is a "multi-dimensional table" made out of a collection of series. Both can contains different kind of data types - for now we will use integers ..

----------
**creating series**

to create a series with any `element`, we can use:

```python
s = pd.Series([element1, element2, element3], name="anynameyouwant")
```

Try now to create two series representing your two favorite fruits and 6 random integers and check one of them:


In [None]:
s1 = pd.Series([3,4,7,8,4,1], name="apples")

s2 = pd.Series([5,9,12,2,9,10], name="bananas")

s1

As we can see, there is one column (as described above) containing the assigned values, but wait .. why is there another column? 

The first column contains the index, in our case we just used the pandas default, that starts again with 0 (remember why?). Consequently, we can again use ```series[1] ``` for indexing the 2 value (row) in our series. 


----------
**create dataframes from scratch**

Usually in data analysis we somehow end up with a .csv file from our experiment, but firstly we will learn how to create dataframes from scratch. There are many different ways and this notebook is certainly not exhaustive:

- we can use a dictionary to combine our two fruit series s1 and s2 to get a dataframe "shoppinglist" by using the ```pd.Dataframe(some_data) ``` Dataframe Builder. Here each (key:value) corresponds to a column: 

In [None]:
fruits= {...} # first we need to arrange our series in a dictionary 

shoppinglist = ... # pd.Dataframe(data) conveniently builds a nice looking dataframe for us 

shoppinglist # show our shoppinglist 

- another way to combine two series to get a dataframe is ``` pd.concat([seriesA, seriesB]) ``` which concatenates your series. Let´s try to recreate the result displayed above:

In [None]:
pd.concat([...])

Oops, something went wrong! Do you have an idea what happened? 



**KEEP IN MIND!** 
(pandas) functions do have a default setting, which might sometimes behave different than expected. 

Remember?  By checking ```pd.concat? ``` in a code cell we see, that the default option for concatenating two objects is along the axis=0, i.e. along the rows! However, we want to recreate the nice looking dataframe above, which means we need to concat the objects along the column axis (i.e., axis=1) and specify it respectively. Let's see whether this works:

In [None]:
shoppinglist = pd.concat([s1,s2], ...)
shoppinglist

Right now, we are still using the pandas default for our index (i.e., numbers). Let´s say, we want to use customer names as an index:

```python
dataframe.set_index([list_of_anything_with_equal_length_to_dataframe])
```

Let´s create a list of 6 customers and replace the current indices with this list to see how many fruits each of them is buying at the Wochenmarkt:

In [None]:
customer=["Victoria", "Rhonda", "Elli", "Rebecca", "Lucie", "Isa"]

In [None]:
shoppinglist = ....
shoppinglist

btw: if you want to check how long your dataframe is, just use ```len(dataframe)``` - pretty easy, huh?

**Adding columns and rows**

*Columns*

The Wochenmarkt is about to close and all our customers are thrilled by all the last-minute sale offers. All of them are about to buy some plums.

Again, many roads lead to Rome and we will just cover some of them:

- declare a pd.series that is to be converted into a column by just creating a new ``` pd.Series ``` with an equal length and use ``` dataframe["new_column_name"] = pd.Series ```


In [None]:
s3 = .... ## does not work if indeces do not correspond

...

shoppinglist

Since series also contain a column that contains our index (if we don´t define it, pandas will use its default!) the index needs to correspond to the index in our dataframe, otherwise we will create a new column with undefined values (i.e. **N**ot **a** **N**umber, NaN values)

- this also works with lists and might be a little bit more convenient ``` dataframe["new_column_name"] = [some_list_with_equal_length]``` since lists do not contain an index

Try to add a new column "lemon" with random values for each customer!

In [None]:
shoppinglist["lemon"] = ...
shoppinglist

- if you want more flexibility, you could also use ```dataframe.insert``` to add a list of values to a new column at a specific position just like this: 

```python
dataframe.insert(position, "column_name", [some_list], True) ## omitting TRUE would raise an error when your 
                                                             ## column name already exists in your dataframe
```



**adding rows**

Oh hey there, we just met Norbert, who is currently doing a smoothie-detox treatment and do you know what? He also likes apples, bananas, oranges, plums and lemons a lot! Let´s add him to our little dataframe!

Again, we can use ```pd.DataFrame```to create a new, single-row dataframe for norbert, that contains values for each of our fruits. To combine our two dataframes, our column names in both dataframes need to be identical! 

```python

new_dataframe = pd.DataFrame([some_list_with_equal_length_to_old_df], columns=old_dataframe)


```

Try to create a new single-row dataframe called Norbert, that contains values for each fruit and uses the column name information of our shoppinglist dataframe!

We can get our column (also index) labels by using ```dataframe.columns/.index```

In [None]:
norbert = ...
norbert

Let´s add Norbert to our shoppinglist dataframe! You are already familiar with ```.append ``` for adding new elements to list!
We can do just the same in our case
```python

dataframe.append(new_dataframe)

```

Lets append Norbert to our dataframe and check our new dataframe!

In [None]:
shoppinglist = ...
shoppinglist

------- 

We already learned at the beginning of this session that we can use 

```pd.concat([element1, element2])``` 


for combining two elements. We can use the same command to combine our two dataframes! Keep in mind, that you might have to specify the axis along which we want to add our new dataframe/row

------- 

**renaming**

What a pity! We forgot to update our index - Norberts name is missing - let´s better change that, before he gets any identity issues!

Do you have an idea how to solve this issue? You essentially already know all the commands to beat the riddler! 

One could 
- update our customer list
- set our index or assign the updated list to dataframe.index
- check our dataframe

- we can also use ```dataframe.rename(index/column = {"old_value:"new_value"}, inplace=True) ``` to solve the issue in just one single line of code. We define ```inplace=True``` which directly allows us to assign the modification to our dataframe. If we stick to the default (i.e. ```False ```) we would need to assign dataframe = dataframe to "save" our modifications

Let´s try to change one of your customer names:

In [None]:
...

In [None]:
shoppinglist

Besides adding and renaming stuff in our dataframe, we could also delete rows or columns by using ```drop``` :
```python

dataframe.drop(index=["element1","element2"])
dataframe.drop(columns=["element1","element2"])
```



Try to delete the first customer in your list:

In [None]:
shoppinglist = shoppinglist....
shoppinglist

**indexing**

We already know from previous sessions, that we can use indexing to assess the first element of a list, the third letter of a string and so on ... in our dataframe universe we can just do the same

*indexing columns or rows*

- the easiest way to index a colum is by using ```dataframe["column"]``` for one column and ```dataframe[["column1", "column2"]] ```for two columns.

Try to index your last two colums:

In [None]:
shoppinglist[...]

When the index operator ```[]``` is passed a str or int, it attempts to find a column with this particular name and return it as a series ... however if we pass a **slice** to the operator, it changes its behavior and selects rows instead. We can do this with *int* as well *str* !

Try to index all rows expect the fist and last one by using an "int-slicing":

In [None]:
shoppinglist[...]

As the simple index operator ```[] ``` is not that flexible, we will have a look at two other ways to index rows and columns! Today we will get to know two different approaches 

- selecting rows and columns by **number** using ```dataframe.iloc[row_selection,column_selection]```

Try to only select the first two rows and all columns:



In [None]:
shoppinglist.... # ":" refers to "all"

Try to select a subset of rows and columns: 

In [None]:
shoppinglist.....

- selecting rows and colums by label/index 
- selecting rows with a boolean 

using ```dataframe.loc[row_selection,column_selection]```

Try to select two rows by using the (customer) index:


In [None]:
shoppinglist...

Try to select three customers and two columns of your choice!

In [None]:
shoppinglist...

Let´s imagine that you are particularly interested in customers that bought more than 8 bananas or exactly 2 lemons. Such questions and row selecting can be easily done by using conditional selections with booleans in ```dataframe.loc[selection]```. Remember what booleans are about? 

If we want select only those customers who bought less than 8 bananas:

In [None]:
shoppinglist.loc[...]

Let´s see how this works: if we use ```dataframe[selection] == some value``` we get a **Pandas Series** with TRUE or FALSE for all our rows: 

In [None]:
...

You can also combine two or more conditional statements using ``` .loc(conditional) & (conditional)```

In [None]:
shoppinglist.loc[...]