Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
In this summary, the main focus is on Python's data analysis module such as usage of pandas library, indexing, slicing, and other data analysis techniques. Other than that, there is also a tutorial of how to read data from local file, database, and read data from calling an API.
In addition to this summary, you can access the materials and modules that has been made by Mr. Samuel Chan about Python for data analysis on this Github repository. You can download all the files from the code button -> Download ZIP. Hopefully this additional material will be helpful for you to learn Python.
Variable is a name that refers to a value. The variable itself assigned using =
operator.
activity = "swimming"
From the code above, the variable activity
is assigned with the value swimming
. Note that the variable name is case sensitive. Which means, if we try to calling the variable Activity
, it will not work.
One thing that also must be noted, that =
and ==
in python is a different thing. =
is used to assign a value to a variable. ==
is used to compare two values. For more complete explanation of python operators you can refer to Data Operator Section.
- [
object
]: store text values - [
int
]: integer values - [
datetime
]: date and time values - [
bool
]: True/False values - [
float
]: floating point values
In Python data type is important because it determines the behavior of the variable. For example, if we assign a variable with integer value, then we can use the variable to perform mathematical operation. There are also some of function that can only be performed using specific type of data. Such as .describe()
function for dataframe that will only calculate for numeric data.
Python itself start the index at 0. So, in python we will start the counting at 0
and not at 1
. For example, if we have a list of 5 elements, then the first element is 0
, the second element is 1
, and so on.
x = [1, 2, 3, 4, 5]
So, from the x list above, we can access the first element by x[0]
, where the value of x[0]
is 1, the value of x[1]
= 2, and so on.
Arithmetic Operator
Arithmetic operators are used to performing mathematical operations like addition, subtraction, multiplication, and division.
Operator | Description | Syntax |
---|---|---|
+ | Addition: adds two operands | x + y |
– | Subtraction: subtracts two operands | x – y |
* | Multiplication: multiplies two operands | x * y |
/ | Division (float): divides the first operand by the second | x / y |
// | Division (floor): divides the first operand by the second | x // y |
% | Modulus: returns the remainder when the first operand is divided by the second | x % y |
** | Power: Returns first raised to power second | x ** y |
Comparison Operator
Comparison of Relational operators compares the values. It either returns True or False according to the condition.
Operator | Description | Syntax |
---|---|---|
> | Greater than: True if the left operand is greater than the right | x > y |
< | Less than: True if the left operand is less than the right | x < y |
== | Equal to: True if both operands are equal | x == y |
!= | Not equal to – True if operands are not equal | x != y |
>= | Greater than or equal to True if the left operand is greater than or equal to the right | x >= y |
<= | Less than or equal to True if the left operand is less than or equal to the right | x <= y |
is | x is the same as y | x is y |
is not | x is not the same as y | x is not y |
Logical Operator
Logical operators perform Logical AND, Logical OR, and Logical NOT operations. It is used to combine conditional statements.
Operator | Description | Syntax |
---|---|---|
and (&) | Logical AND: True if both the operands are true | x and y |
or ( | ) | Logical OR: True if either of the operands is true |
not (~) | Logical NOT: True if the operand is false | not x |
Python is very powerful because there are many libraries and packages that support the Python language. Many libraries and packages are helping python user to do some specific task. Some of the most famous libraries or packages that usually used by Python user are: pandas, numpy, matplotlib, scipy, scikit-learn, and seaborn.
- pandas : specifically for data analysis and data manipulation
- numpy : for data manipulation and data analysis
- matplotlib : for data visualization
- seaborn : for data visualization
- scipy : for data analysis (statistics)
- scikit-learn : for machine learning
In the next chapter, we will specifically talk about the Pandas library for data analysis and manipulation in python.
Pandas library is famous package that is used for data analysis and manipulation. It provide a way for python user to easily load tabular data and perform data analysis. It is also very easy to use and very powerful.
One of the most famous function is pd.DataFrame()
that is used to create a dataframe. The dataframe is a two dimensional array that is indexed by row and column. Or it also can be said as a tabular data like we usually load into microsoft excel.
In the following section, we'll take a closer look at some of the most common slicing and subsetting operations in pandas
:
-
head()
andtail()
Thehead()
andtail()
functions return the first or last n rows of the dataframe.df.head()
or
df.tail()
By default, if the
n=
parameter is not specified,head()
returns the first 5 rows andtail()
returns the last 5 rows. -
select_dtypes()
The
select_dtypes()
function returns a subset of the dataframe with only the columns that have the specified data types.df.select_dtypes(include=['int64'])
It means that the function will return a subset of the dataframe with only the columns that have integer data types.
-
Using
.drop()
The
.drop()
function is used to drop rows or columns from the dataframe.df.drop(['column_name'], axis=1)
It means that the function will drop the column
column_name
from the dataframe. -
The
[]
operatorThe
[]
operator is used to access the rows or columns of the dataframe.df[['a', 'b']]
It means that the function will return a subset of the dataframe with only the columns that have the specified data types.
-
.loc
The
.loc
function is used to select rows or columns from the dataframe. Using .loc, we can select rows or columns by their index name or integer position.df.loc[0:2, ['A','C']]
It means that the function will return a subset of the dataframe with only the rows that have index 0, 1, and 2 and the columns that have column names
A
andC
. -
.iloc
The
.iloc
function is used to select rows or columns from the dataframe. Using .iloc, we can select rows or columns only by their index integer position.df.iloc[0:2, 0:3]
It means that the function will return a subset of the dataframe with only the rows that have index 0, 1, and 2 and the columns that have index 0, 1, and 2. The difference between
.loc
and.iloc
is that.loc
is used to select rows or columns by their index name or integer position and.iloc
is used to select rows or columns only by their index integer position. -
Conditional subsetting
Conditional subsetting is used to select rows or columns from the dataframe based on some condition.
df[df['A'] > 0]
It means that the function will return a subset of the dataframe with only the rows that have
A
column values greater than 0.
-
.describe()
The
.describe()
function is used to summarize the dataframe.df.describe()
It means that the function will return a summary of the dataframe especially for numeric columns. It will shown the count, mean, standard deviation, minimum, maximum, and percentiles of each numeric columns in the dataframe.
-
.info()
The
.info()
function is used to show the dataframe information.df.info()
It means that the function will return a summary of the dataframe. It will show the number of rows, columns, data types, and memory usage of the dataframe.
-
.columns
The
.columns
function is used to show the column names of the dataframe.df.columns
It means that the function will return a list of column names of the dataframe.
-
.shape
The
.shape
function is used to show the shape of the dataframe.df.shape
It means that the function will return a tuple of the number of rows and columns of the dataframe.
-
.value_counts()
The
.value_counts()
function is used to show the value counts of the dataframe.df["A"].value_counts()
It means that the function will return a series of the value counts of the dataframe. It can show how many times each unique value appears in the dataframe.
-
.sort_values()
The
.sort_values()
function is used to sort the dataframe.df.sort_values(by=['A'], ascending=False)
It means that the function will return a sorted dataframe based on the values in the column
A
. -
.groupby()
The
.groupby()
function is used to group the dataframe by some column.df.groupby(['A'])
It means that the function will return a grouped dataframe based on the values in the column
A
. -
pd.crosstab()
The
pd.crosstab()
function is used to show the cross tabulation of the dataframe.pd.crosstab(df['A'], df['B'])
It means that the function will return a cross tabulation of the dataframe.
-
pd.merge()
The
pd.merge()
function is used to merge two dataframe into one dataframe (or usually also known asjoin
).pd.merge(df1, df2, on='A')
It means that the function will return a merged dataframe based on the values in the column
A
.
-
pd.to_datetime()
The
pd.to_datetime()
function is used to convert the dataframe to datetime format.pd.to_datetime(df['A'])
It means that the function will return a datetime format of the values in the column
A
. -
dt.date
The
dt.date
function is used to convert the datetime format to date format.df["date"].dt.date
It means that the function will return a date format of the values in the column
A
. -
dt.hour
The
dt.hour
function is used to show the hour of the datetime format.df["date"].dt.hour
It means that the function will return the hour of the datetime format.
-
dt.day_name()
The
dt.day_name()
function is used to show the day name of the datetime format.df["date"].dt.day_name()
It means that the function will return the day name of the datetime format.
-
pd.Series.str.split()
The
pd.Series.str.split()
function is used to split the string in the dataframe.df["A"].str.split(",")
The function will return a list of the values in the column
A
split by the comma. -
pd.Series.str.contains()
The
pd.Series.str.contains()
function is used to replace the string in the dataframe.df["A"].str.contains("a")
It means that the function will return a boolean of the values in the column
A
. If the string contains the lettera
, then the boolean will be True. Otherwise, the boolean will be False. -
pd.Series.str.replace()
The
pd.Series.str.replace()
function is used to replace the string in the dataframe.df["A"].str.replace("a", "b")
It means that the function will return a string of the values in the column
A
. If the string contains the lettera
, then the string will be replaced with the letterb
. -
pd.Series.str.len()
The
pd.Series.str.len()
function is used to show the length of the string in the dataframe.df["A"].str.len()
It means that the function will return a series of the length of the values in the column
A
. -
pd.Series.str.lower()
The
pd.Series.str.lower()
function is used to show the lower case of the string in the dataframe.df["A"].str.lower()
It means that the function will return a series of the lower case of the values in the column
A
. -
pd.Series.str.upper()
The
pd.Series.str.upper()
function is used to show the upper case of the string in the dataframe.df["A"].str.upper()
It means that the function will return a series of the upper case of the values in the column
A
. -
pd.Series.str.startswith()
The
pd.Series.str.startswith()
function is used to show the start with of the string in the dataframe.df["A"].str.startswith("a")
It means that the function will return a series of the start with of the values in the column
A
. If the string starts with the lettera
, then the boolean will be True. Otherwise, the boolean will be False.
There are several ways to read data in python. One of the most used function is read data from local file using pandas library.
df = pd.read_csv("data.csv")
The function will return a dataframe of the data in the file data.csv
.
There are also several ways to read another data format using pandas library. Some of the data formats are: pd.read_excel(), pd.read_json(), pd.read_html(), pd.read_sql(), pd.read_sql_query(), pd.read_hdf(), pd.read_pickle(), pd.read_sas(), pd.read_stata(), pd.read_table(), pd.read_clipboard(), pd.read_fwf(), pd.read_gbq(), pd.read_gbq().
The other way to import data is from the database. To read the data from the database, we must know what database we used. The explanation below will show how to read data from certain database.
-
Sqlite
To import data from sqlite file, we need to import the sqlite3 and pandas library. The code below will show the way to import data from sqlite file.
import sqlite3 import pandas as pd conn = sqlite3.connect("path_to_file") #fill the "path to file" with the path to the file albums = pd.read_sql_query("SELECT * FROM table_name", conn)
-
MySql
To import data from mysql, we need to import the pymysql and pandas library. The code below will show the way to import data from mysql.
import pymysql conn = pymysql.connect(host=host, port=port, user=user, password=password, db=database) sales = pd.read_sql_query("SELECT * FROM table_name", conn)
-
Postgresql
To import data from postgresql, we need to import the sqlalchemy and pandas library. The code below will show the way to import data from postgresql.
import sqlalchemy db_address = f'postgresql+psycopg2://{username}:{password}@{host}:{port}/{database}' engine = sqlalchemy.create_engine(db_address) sales = pd.read_sql_query("SELECT * FROM table_name", conn)
To call data from API we need to import requests
, pandas
, and json
library. The code below will show the way to call data from API.
import requests
import json
req = requests.get(url) #fill the "url" with the url of the API
req.status_code
If the request is successful, the status code will be 200. Otherwise, the status code will be 400. Then, if the status code is 200, we can use the json.loads()
function to convert the response to a dictionary and use the pd.DataFrame()
function to convert the dictionary to a dataframe.
text = json.loads(req.text)
face = pd.DataFrame(text)