Pandas Crash Course Prepared by Berk Varol & OpenAI ChatGPT
-------------------------------------




Pandas is a powerful and popular open-source Python library for data analysis and manipulation. It is built on top of the NumPy library and provides easy-to-use data structures and data analysis tools for working with tabular data, such as data stored in CSV or Excel files.

Here is a brief crash course on using Pandas:

First, you need to install the library by running pip install pandas in your command line or terminal.

To use Pandas in your Python code, you need to import it:

In [1]:
import pandas as pd

The two main data structures in Pandas are the "DataFrame" and the *Series*.

A DataFrame is a table, similar to an Excel spreadsheet, with rows and columns. 

A Series is a single column or row of data in a DataFrame.



CREATING A DATAFRAME
-------------------------------------------------------------------------------------------
To create a DataFrame, you can use the pd.DataFrame() constructor, which takes in a dictionary of lists (or other iterable objects) as its argument.

The keys of the dictionary become the column names of the DataFrame, and the values become the rows. Here is an example:

In [2]:
data = {
  'name': ['John', 'Jane', 'Bob', 'Alice'],
  'age': [24, 45, 35, 33],
  'gender': ['M', 'F', 'M', 'F']
}

df = pd.DataFrame(data)

This creates a DataFrame with three columns ('name', 'age', 'gender') and four rows (one for each person in the data).

SERIES OBTAINED FROM DATAFRAMES
---------------------------

To access a specific column of a DataFrame, you can use the [] operator and the column name:

In [5]:
names = df['name']
print (names)

0     John
1     Jane
2      Bob
3    Alice
Name: name, dtype: object


This returns a Series object containing the values in the 'name' column of the DataFrame.

To access a specific row of a DataFrame, you can use the loc[] method and the index of the row:

In [9]:
first_row = df.loc[0]
print (first_row)

name      John
age         24
gender       M
Name: 0, dtype: object


This returns a Series object containing the values in the first row of the DataFrame.

You can also use the iloc[] method to access rows by their position in the DataFrame (instead of by their index):

In [8]:
first_row = df.iloc[0]
print(first_row)

name      John
age         24
gender       M
Name: 0, dtype: object


To select a subset of rows and columns from a DataFrame, you can use the loc[] or iloc[] method along with the [] operator. Here is an example that selects the 'name' and 'age' columns for the first two rows of the DataFrame:

In [14]:
subset = df.loc[:2, ['name', 'age']]
print (subset)

   name  age
0  John   24
1  Jane   45
2   Bob   35


You can also use boolean indexing to select rows from a DataFrame based on a condition. For example, the following code selects all rows where the 'age' column is greater than or equal to 35:

In [13]:
selected_rows = df[df['age'] >= 35]
print(selected_rows)

   name  age gender
1  Jane   45      F
2   Bob   35      M












ELEMENTWISE OPERATIONS ON PANDAS
--------------------------


Pandas provides several ways to perform element-wise operations on data in a DataFrame or Series.

One way is to use the apply() method, which applies a function to each element of a DataFrame or Series. Here is an example that applies the len() function to each element in a Series of strings:



In [15]:

# Create a Series of strings
s = pd.Series(['apple', 'banana', 'cherry', 'durian'])

# Apply the len() function to each element in the Series
s_lengths = s.apply(len)

# Print the resulting Series
print(s_lengths)

0    5
1    6
2    6
3    6
dtype: int64


Another way to perform element-wise operations is to use the map() method, which applies a function to each element of a Series and returns a new Series containing the transformed values. This is similar to the apply() method, but it is faster and more concise. Here is an example that uses the map() method to convert the strings in a Series to uppercase:

In [16]:
# Create a Series of strings
s = pd.Series(['apple', 'banana', 'cherry', 'durian'])

# Convert the strings to uppercase using the map() method
s_upper = s.map(str.upper)

# Print the resulting Series
print(s_upper)

0     APPLE
1    BANANA
2    CHERRY
3    DURIAN
dtype: object


Another way to perform element-wise operations is to use the applymap() method, which applies a function to each element of a DataFrame. This is similar to the apply() method, but it applies the function to each element of the DataFrame instead of applying it to the entire DataFrame or Series. Here is an example that uses the applymap() method to round the values in a DataFrame to two decimal places:

In [20]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1.234, 4.556, 7.895],
  'B': [2.354, 5.675, 8.905],
  'C': [3.455, 6.785, 9.015]
})

# Round the values in the DataFrame to two decimal places
df_rounded = df.applymap(lambda x: round(x, 2))

# Print the resulting DataFrame
print(df_rounded)

      A     B     C
0  1.23  2.35  3.46
1  4.56  5.67  6.79
2  7.89  8.90  9.02


FINDING MAXIMUM OF A ROW/COLUMN (We will be using this actively a lot)
-------------------------

To find the maximum value in a row of a DataFrame in Pandas, you can use the max() method along with the axis parameter, which specifies whether to find the maximum value along the rows or columns.
axis=0 max on a column 
axis=1 max on a row

Here is an example that uses the max() method to find the maximum value in each row of a DataFrame:

In [25]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1, 4, 7],
  'B': [2, 5, 8],
  'C': [3, 6, 9]
})

# Find the maximum value in each row
max_values = df.max(axis=1) #axis = 0 dersem column maxi buluyor!!!!!!!!!!!!!!!!!!!

# Print the resulting Series
print(max_values)

0    3
1    6
2    9
dtype: int64


Alternatively, you can use the idxmax() method to find the column label of the maximum value in each row. Here is an example that uses the idxmax() method to find the column labels of the maximum value in each row:

In [26]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1, 4, 7],
  'B': [2, 5, 8],
  'C': [3, 6, 9]
})

# Find the column labels of the maximum value in each row
max_labels = df.idxmax(axis=1)

# Print the resulting Series
print(max_labels)

0    C
1    C
2    C
dtype: object


Note that the max() and idxmax() methods return the maximum value and its corresponding row label for each column by default. If you want to find the maximum value and its corresponding row label for each row instead, you can set the axis parameter to 1 instead of 0.

FINDING MIN OF A ROW/COLUMN
---------------------------


To find the minimum value in a column of a DataFrame in Pandas, you can use the min() method along with the axis parameter, which specifies whether to find the minimum value along the rows or columns.

Here is an example that uses the min() method to find the minimum value in each column of a DataFrame:

In [32]:
# Find the minimum value in each column
min_values = df.min(axis=1)

# Print the resulting Series
print(min_values)

0    1
1    4
2    7
dtype: int64


Alternatively, you can use the idxmin() method to find the row label of the minimum value in each column. Here is an example that uses the idxmin() method to find the row labels of the minimum value in each column:



In [34]:
# Find the row labels of the minimum value in each column
min_labels = df.idxmin(axis=1)

# Print the resulting Series
print(min_labels)

0    A
1    A
2    A
dtype: object


Note that the min() and idxmin() methods return the minimum value and its corresponding row label for each column by default. If you want to find the minimum value and its corresponding row label for each row instead, you can set the axis parameter to 1 instead of 0.


AVERAGE OF A COLUMN OR ROW 
--------------------------

To find the average of the values in a column of a DataFrame in Pandas, you can use the mean() method along with the axis parameter, which specifies whether to find the average along the rows or columns.

Here is an example that uses the mean() method to find the average of the values in each column of a DataFrame:

In [28]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1, 4, 7],
  'B': [2, 5, 8],
  'C': [3, 6, 9]
})

# Find the average of the values in each column
avg_values = df.mean(axis=0)

# Print the resulting Series
print(avg_values)

A    4.0
B    5.0
C    6.0
dtype: float64


In [35]:

# Find the average of the values in each ROW
avg_values = df.mean(axis=1)

# Print the resulting Series
print(avg_values)

0    2.0
1    5.0
2    8.0
dtype: float64


NUMERICAL OPERATIONS BETWEEN ROWS & COLLUMNS (we will use this a lot aswell)
-----------------------------

To perform numerical operations between the values in different rows of a DataFrame in Pandas, you can use the apply() method along with the axis parameter, which specifies whether to apply the function to the rows or columns of the DataFrame.

Here is an example that uses the apply() method to subtract the values in the first row of a DataFrame from the values in the second row:

In [47]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1, 4, 7],
  'B': [2, 5, 8],
  'C': [3, 6, 9]
})

# Subtract the values in the first row from the values in the second row
differences = df.apply(lambda row: row[1] - row[0], axis=0) #böyle tek tek de oluyor
sums =df.apply(lambda row:sum(row), axis=1) #böyle de sumlayabiliyosun!!!!!!!
# Print the resulting Series
print(sums)

0     6
1    15
2    24
dtype: int64


Note that the apply() method applies the function to each row by default (when axis=1). If you want to apply the function to each column instead, you can set the axis parameter to 0.

Another way to perform numerical operations between the values in different rows of a DataFrame is to use the sub(), mul(), div(), and other arithmetic methods, which perform element-wise arithmetic operations between DataFrames 

MULTIPLYING OR DIVIDING EACH ROW/ COLUMN WITH A VARIABLE
---------------------------

To multiply or divide each row or column of a DataFrame in Pandas by a variable, you can use the mul() or div() methods, respectively, along with the axis parameter, which specifies whether to apply the operation along the rows or columns of the DataFrame.

Here is an example that uses the mul() method to multiply each row of a DataFrame by a variable:

In [43]:
# Create a DataFrame
df = pd.DataFrame({
  'A': [1, 4, 7],
  'B': [2, 5, 8],
  'C': [3, 6, 9]
})

# Multiply each row of the DataFrame by 2
df_mul = df.mul(2, axis=1)

# Print the resulting DataFrame
print(df_mul)

    A   B   C
0   2   4   6
1   8  10  12
2  14  16  18


Similarly, you can use the div() method to divide each row of a DataFrame by a variable:

In [44]:
# Divide each row of the DataFrame by 2
df_div = df.div(2, axis=1)

# Print the resulting DataFrame
print(df_div)

     A    B    C
0  0.5  1.0  1.5
1  2.0  2.5  3.0
2  3.5  4.0  4.5


Note that the mul() and div() methods apply the operation along the rows of the DataFrame by default (when axis=1). If you want to apply the operation along the columns instead, you can set the axis parameter to 0.

Another way to multiply or divide each row or column of a DataFrame by a variable is to use the * or / operators, respectively. For example, the following code multiplies each row of a DataFrame by 2 using the * operator:

In [48]:
# Multiply each row of the DataFrame by 2
df_mul = df.apply(lambda row: row * 2, axis=1)

# Print the resulting DataFrame
print(df_mul)

    A   B   C
0   2   4   6
1   8  10  12
2  14  16  18


HOW TO USE .XLSL DOCS AS DATAFRAMES
--------------------------


To use an Excel file as a DataFrame in Pandas, you can use the read_excel() method of the pandas library. This method takes the Excel file as its input and returns a DataFrame that you can use to perform various operations, such as selecting, filtering, sorting, and more.

Here is an example that shows how to use the read_excel() method to create a DataFrame from an Excel file:



In [None]:
import pandas as pd

# Read the Excel file into a DataFrame
df = pd.read_excel('data.xlsx')

# Print the first 5 rows of the DataFrame
print(df.head())

This code assumes that the Excel file is called data.xlsx and is located in the same directory as the Python script. If the file is located in a different directory, you can specify the full path to the file as the input to the read_excel() method.

Note that the read_excel() method reads the first sheet of the Excel file by default. If you want to read a different sheet from the file, you can specify the sheet name or index as the sheet_name parameter of the read_excel() method. For example, the following code reads the second sheet of an Excel file and prints the first 5 rows of the resulting DataFrame:

In [None]:
# Read the second sheet of the Excel file into a DataFrame
df = pd.read_excel('data.xlsx', sheet_name=1)

# Print the first 5 rows of the DataFrame
print(df.head())

If you want to read multiple sheets from an Excel file, you can use the 'pandas.ExcelFile' class, which provides a way to iterate over the sheets of an Excel file. Here is an example that shows how to use the ExcelFile class to read all the sheets of an Excel file and print the first 5 rows of each sheet:

In [None]:
# Open the Excel file
excel_file = pd.ExcelFile('data.xlsx')

# Iterate over the sheets of the Excel file
for sheet_name in excel_file.sheet_names:
  # Read the current sheet into a DataFrame
  df = excel_file.parse(sheet_name)

  # Print the first 5 rows of the DataFrame
  print(df.head())