# **Introduction**

[Pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

# **Series**

[Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) (including time series) are one-dimensional ndarray with axis labels.

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

**Creating Series**

It is possible to create series from various other structures, such as lists, arrays, dicts and scalars

In [62]:
# import pandas
import pandas as pd

In [None]:
# series from list
my_list = ["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"] # define list

# create series
first_serie = pd.Series(my_list)
first_serie

In [64]:
# import NumPy
import numpy as np

In [None]:
# series from array
my_array = np.array(["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "9111"])

# create series
second_serie = pd.Series(my_array)
second_serie

In [None]:
# series from dicts
my_dict = {"Audi": "A3", "BMW": "320i", "VW":"T-Roc", "Renault":"Captur", "Citroen": "C4", "Peugeot": "3008", "Mercedes": "SLK200", "Porche": "911"}

# create series
third_serie = pd.Series(my_dict)
third_serie

In [None]:
# series from scalar
fourth_serie = pd.Series(7, index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"])
fourth_serie

**Accessing Series' Elements**

To access an element within a series is similar to access an array's element. You have just to use the element's index.

In [None]:
# accessing elements
first_serie[0]

In [None]:
# accessing elements
third_serie["Audi"]

**Vectorizing Elements in Series**

Allows to perform operations on entire Series, rather than iterating through individual elements.

In [None]:
# creating two series
first_store = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"])
second_store = pd.Series([10, 20, 30, 40, 50, 60, 70, 80], index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"])

# calculating the total
total_stores = first_store + second_store
total_stores

Let us change the indexes' orde in the second_store series:

In [None]:
# creating two series
first_store = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"])
second_store = pd.Series([10, 20, 30, 40, 50, 60, 70, 80], index=["C4", "3008", "SLK200", "911", "A3", "320i", "T-Roc", "Captur"])

# calculating the total
total_stores = first_store + second_store
total_stores

Let us replace some indexes in the first vector

In [None]:
# creating two series
first_store = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "Juke", "Yaris"])
second_store = pd.Series([10, 20, 30, 40, 50, 60, 70, 80], index=["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"])

# calculating the total
total_stores = first_store + second_store
total_stores

# **DataFrames**

DataFrames are two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels and they can be thought of as a dict-like container for Series objects.

They are the primary pandas data structure.

In [None]:
# creating a list
list_of_cars = {"Brand": ["Audi", "BMW", "VW", "Renault", "Citroen", "Peugeot", "Mercedes", "Porche"],
                "Cars":["A3", "320i", "T-Roc", "Captur", "C4", "3008", "SLK200", "911"],
                "Quantity": [100, 95, 12, 45, 32, 16, 22, 34]}

# creating dataframe from a list
cars_data = pd.DataFrame(list_of_cars)
cars_data

**Creating a DataFrame from a GitHub file**

To do this, follow this three steps:

1. Click on the dataset in your repository, then click on View Raw.
2. Copy the link to the raw dataset and store it as a string variable called url or path (or other name you want) in Colab.
3. Load the url into Pandas read_csv to get the dataframe.

In [None]:
# creating a DataFrame from MS Excel (or CSV) file on GitHub
url = "https://raw.githubusercontent.com/carloscesarferreira/pythoncourse/main/Season%20II%20%E2%80%93%20Data%20Science%20with%20Python/datasets/FDNY_Firehouse_Listing.csv"
df = pd.read_csv(url)
df

**Manipulating the DataFrame**

We can manipulate a dataframe in different ways, such as follows:

In [None]:
# read the first lines of the DataFrame
df.head() # by default, the 5 first lines

In [None]:
# read a specific number of first lines
df.head(8) # read the 8 first lines

In [None]:
# read the last lines of the DataFrame
df.tail() # by default, the 5 last lines

In [None]:
# read a specific number of last lines
df.tail(9) # read the 8 first lines

In [None]:
# counting the rows and columns in DataFrame
df.shape

In [None]:
# get the summary of dataframe's statistics
df.describe()

In [None]:
# sort values by an specific column (ascending or descending)
df.sort_values(by=["Census Tract"], ascending=True)

In [None]:
# sort values using multiples columns (ascending or descending)
df.sort_values(by=["Census Tract", "Borough"], ascending=[True, False])

In [None]:
# add a new column
df["New Column"] = np.random.random(219)
df

In [None]:
# drop a column
df = df.drop("New Column", axis = 1) # axis number (0 for rows and 1 for columns)
df

In [None]:
# add a column in a specific position
df.insert(0, "New Column 2", np.random.random(219))
df

In [None]:
df = df.drop("New Column 2", axis = 1) # drop the added column
df

In [None]:
# add a line in the data frame
df.loc[219,:] = ["New Facility", "New Address", "New Borough", 999999, 10.0000, 20.0000, 999, 999, 999, 999, 999, "NTA"]
df

In [None]:
# drop a line
df.drop(219, axis = 0, inplace = True)
df

In [None]:
# creating a new column using mathematical operations
df["AverageLatitudeLongitude"] = (df["Latitude"]+df["Longitude"])/2
df

In [None]:
# selecting specific lines based on logical operators
selected_lines = df[df.AverageLatitudeLongitude < -16.82]
selected_lines

In [None]:
# selecting specific lines based on logical operators
selected_lines2 = df[(df["Postcode"] == 10312.0) | (df["Community Board"] > 200)]
selected_lines2