# Introduction to Python for Data Analytics

## Python as calculator

In [None]:
2 * 3

In [None]:
10 + 2 - 1 * 3

In [None]:
3 ** 2

## Comments

In [None]:
# if some code is beginning with "#" it is a comment - python "ignores" it when running the code, i.e., it is not shown in the output

## Connect Google Colab to Google Drive

First of all, we need to setup the infrastructure. Let's connect Google Collab to Google Drive. We need that for uploading files.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Variables and Data Types

We can store numbers, results of calculations or all kind of other values in variables. 

Let's make a small example of the University of St. Gallen's gift shop:

In [None]:
revenue = 10000
profit = 1500
margin = profit / revenue
print(margin)

### Integers

Integers are whole numbers like 1, -2, or 3

In [None]:
type(revenue)

### Floats

Floats are real numbers, numbers that have an integer part and a fraction part, for example 1.23 or -6.32

In [None]:
type(margin)

### Strings

Text is stored as a string variable. Values that are stored as string are usually indicated by parentheses " " or ' '

In [None]:
text = "123"
type(text)

### Booleans

Variables that can only take the values of True or False

In [None]:
x=True
type(x)

# Python Lists

Variables can only store single values. Lists are "containers" that are provided by python that can store several values.

Let's have a look at the products of our small shop

In [None]:
Shirts = 500 
Hoodies = 200
Bags = 100

we can create a list with []

In [None]:
items = [Shirts, Hoodies, Bags]
print(items)

Lists can handle all types of data (and also other lists)

In [None]:
items = ["Shirts", Shirts, "Hoodies", Hoodies, "Bags", Bags, "Shoes", 200]
print(items)

## Filtering from Lists (Subsetting)

## Selecting single Elements

We can access single values from a list if the put a number in the [ ]. 

In principle **listname[n]** should return the nth value from the list. However, python uses **zero-based indexing**, i.e., it **starts counting with 0**.

Thus, list[n] returns the nth-1 element from the list.



In [None]:
print(items)

In [None]:
# get first element
print(items[0])

# get second element
print(items[1])

# get last item
print(items[-1])

## Multiple items 

We can also select "slices" of elements. 

list **[0:2]** selects the first and the second element of our list. The first element is inclusive and end the second element is exclusive: 

**[ inclusive : exclusive ]**

In [None]:
# get the first to elements
items[0:2]

# get the first four items
items[:4]

# get the third element and everything beyond
items[2:]

## Update Lists

In [None]:
print(items)

# update the first element of a list
items[0] = "T-Shirts"
print(items)

# delete items of a list
del(items[-1])
print(items)

# combine different lists
items = items + ["200", "Coffee Cups", "123"]
print(items)

# Function and Libraries

Python offers a magnitude of built-in functions. Functions are code that has been written by other persons that carries out some specific task. Using such functions, we can re-use that code and do not have to write it by ourselves!  

We already worked with functions: **type()** or **del()**. 

Other built-in python functions that are handy for changing data types are
- bool()
- float()
- int()
- str() 

A function usually consists of two parts. A **"function name"** (e.g., "str") and one or more inputs that are called **arguments** (e.g., "data"):

**function_name(argument)**

In [None]:
print(revenue)
print(type(revenue))

revenue_str = str(revenue)
print(revenue_str)
print(type(revenue_str))

## Libraries

A library is a collection of functions that have been contributed by the python open source community and that can be used by us

## Pandas

One of the most important libraries for data analysis in python is called **pandas**. 

Pandas can be used to 
- Loading data sets into python 
- Manipulating data and performing mathematical standard operations on the data
- Providing data structures to organize data in rows and columns (as in typical spreadsheet software). "Spreadsheets" in python are called **DataFrames**
- Perform basic visualization

Libraries need to be initialized before they can be used in python. We can easily initialize pandas with the following command:

In [None]:
import pandas as pd

## Import Data

In [None]:
df = pd.read_excel("https://github.com/casbdai/datasets/raw/main/Module2/Onboarding/Examples/hsgshop.xlsx", sheet_name=0)

## Methods

## Methods

Methods are similar to functions - they are prepackaged code snippets that can carry out a speficic task. However, they are directly associated to a python object (e.g., the DataFrame "df" that we have just created). Methods are called on a given object and applied to the data within that object / DataFrame. 

The logic for using a method is always

**objectname.methodname()**

In [None]:
df.head()

In [None]:
df.info()

Pandas provides all kind of summary statistcs for performing exploratory data analysis:

In [None]:
df.mean() # get mean for each variable in data frame
df.corr() # returns the correlation between columns in a data frame
df.count() # returns the number of non-null values in each data frame column
df.std() # returns the standard deviation of each column
df.median() # returns the median of each column
df.max() # returns the highest value in each column
df.min() # returns the lowest value in each colum


## Adding and deleting new Features

In [None]:
df["Total Revenue"] =  df["Revenue 2019"] + df["Revenue 2018"]
df["Revenue Growth"] = (df["Revenue 2019"] / df["Revenue 2018"])-1

df.head()

In [None]:
del(df["Total Revenue"])

df.head()

# Selecting Data in Data Frames

## integer-location based indexing (iloc)

### selecting rows

In [None]:
df.iloc[0] # get the first row of a data frame

In [None]:
df.iloc[[0,1,3,5]] # select row 1,2,4,6

In [None]:
df.iloc[0:5]

### Selecting columns

In [None]:
df.iloc[:,0] # selects the first column; ":" indicates to use all rows

In [None]:
df.iloc[:,[1,3]] # select 2nd and 4th columns

In [None]:
df.iloc[:,0:2]

### Selecting rows and columns

In [None]:
df.iloc[0:4,[0,3]] # we get the first four rows and column 1 and 4

### Summary

Using the iloc, we can directly select rows and numbers of within our DataFrame. 

The logic is always: 

**df.iloc [ "row selection" , "column selection" ]**

## Location-based indexing (loc)

Selecting data based on .loc, works very similar to the .iloc method. The only difference is that we can use the names of the variables.



In [None]:
df.head()

In [None]:
df.loc[0:3,["Revenue 2018", "Revenue 2019"]] 

### Boolean Indexing 

Using the .loc method, we can select data based on certain conditions. This is called boolean indexing.

Some Boolean Operators are:

- greater than   >
- smaller than   <
- equals         ==
- does not equal !=


In [None]:
print(5 > 3)
print(5 == 3)
print(5 != 3)

In [None]:
df.loc[df["Revenue 2018"] > 25000,]

AND / OR Operators

In [None]:
(5 > 3) & (5 == 3) # AND-Operators return True if all conditions are True

In [None]:
(5 > 3) | (5 == 3) # OR-Operators return True if one of the conditions is True

In [None]:
df.loc[ (df["Revenue Growth"] > 0) & (df["Product"] != "Sweatshirts") & (df["Product"] != "Sweetshirts"),]

## Basic Plotting with Pandas

In [None]:
df.hist(column=["Revenue Growth", "Revenue 2019"]) #creating a histogram

In [None]:
df.plot(kind="bar", x="Product", y="Revenue 2019")

In [None]:
df.plot(kind="line", x="Product", y="Revenue 2019")

In [None]:
df.plot(kind="scatter", x="Revenue 2018", y="Revenue 2019")