<a href="https://colab.research.google.com/github/chien2734/sgu_data_analyst/blob/main/Personal/Book/%5BBOOK%5D_Python_for_DataAnalyst.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> BÀI LÀM

# CHAPTER 1: Prelininaries
This book is concerned with the nuts and bolts of manipulating, processing, cleaning,
and crunching data in Python. My goal is to offer a guide to the parts of the Python
programming language and its data-oriented library ecosystem and tools that will
equip you to become an effective data analyst. While “data analysis” is in the title
of the book, the focus is specifically on Python programming, libraries, and tools as
opposed to data analysis methodology. This is the Python programming you need for
data analysis.

Sometime after I originally published this book in 2012, people started using the
term data science as an umbrella description for everything from simple descriptive
statistics to more advanced statistical analysis and machine learning. The Python
open source ecosystem for doing data analysis (or data science) has also expanded
significantly since then. There are now many other books which focus specifically on
these more advanced methodologies. My hope is that this book serves as adequate
preparation to enable you to move on to a more domain-specific resource.

# CHAPTER 2: Python Language Basics, IPython, and Jupyter Notebooks
When I wrote the first edition of this book in 2011 and 2012, there were fewer
resources available for learning about doing data analysis in Python. This was
partially a chicken-and-egg problem; many libraries that we now take for granted,
like pandas, scikit-learn, and statsmodels, were comparatively immature back then.
Now in 2022, there is now a growing literature on data science, data analysis, and
machine learning, supplementing the prior works on general-purpose scientific com‐
puting geared toward computational scientists, physicists, and professionals in other
research fields. There are also excellent books about learning the Python program‐
ming language itself and becoming an effective software engineer.

As this book is intended as an introductory text in working with data in Python, I
feel it is valuable to have a self-contained overview of some of the most important
features of Python’s built-in data structures and libraries from the perspective of data
manipulation. So, I will only present roughly enough information in this chapter and
Chapter 3 to enable you to follow along with the rest of the book.

Much of this book focuses on table-based analytics and data preparation tools for
working with datasets that are small enough to fit on your personal computer. To
use these tools you must sometimes do some wrangling to arrange messy data into
a more nicely tabular (or structured) form. Fortunately, Python is an ideal language
for doing this. The greater your facility with the Python language and its built-in data
types, the easier it will be for you to prepare new datasets for analysis.

Some of the tools in this book are best explored from a live IPython or Jupyter
session. Once you learn how to start up IPython and Jupyter, I recommend that you
follow along with the examples so you can experiment and try different things. As with any keyboard-driven console-like environment, developing familiarity with the
common commands is also part of the learning curve.

# CHAPTER 3: Built-In Data Structures Functions, and Files
This chapter discusses capabilities built into the Python language that will be used
ubiquitously throughout the book. While add-on libraries like pandas and NumPy
add advanced computational functionality for larger datasets, they are designed to be
used together with Python’s built-in data manipulation tools.

We’ll start with Python’s workhorse data structures: tuples, lists, dictionaries, and sets.
Then, we’ll discuss creating your own reusable Python functions. Finally, we’ll look at
the mechanics of Python file objects and interacting with your local hard drive.

# CHAPTER 4: NumPy Basics: Arrays and Vectorized Computation
NumPy, short for Numerical Python, is one of the most important foundational pack‐
ages for numerical computing in Python. Many computational packages providing
scientific functionality use NumPy’s array objects as one of the standard interface
lingua francas for data exchange. Much of the knowledge about NumPy that I cover is
transferable to pandas as well.

Here are some of the things you’ll find in NumPy:

*  ndarray, an efficient multidimensional array providing fast array-oriented arith‐
metic operations and flexible broadcasting capabilities
*  Mathematical functions for fast operations on entire arrays of data without hav‐
ing to write loops
*  Tools for reading/writing array data to disk and working with memory-mapped
files
*  Linear algebra, random number generation, and Fourier transform capabilities
*  A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

# CHAPTER 5: Getting Started with pandas
pandas will be a major tool of interest throughout much of the rest of the book. It
contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and convenient in Python. pandas is often used in tandem with
numerical computing tools like NumPy and SciPy, analytical libraries like statsmo‐
dels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts
significant parts of NumPy’s idiomatic style of array-based computing, especially
array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggestabout difference
is that pandas is designed for working with tabular or heterogeneous data. NumPy, by
contrast, is best suited for working with homogeneously typed numerical array data.

# CHAPTER 6: Data Loading, Storage, and File Formats
Reading data and making it accessible (often called data loading) is a necessary first
step for using most of the tools in this book. The term parsing is also sometimes used
to describe loading text data and interpreting it as tables and different data types. I’m
going to focus on data input and output using pandas, though there are numerous
tools in other libraries to help with reading and writing data in various formats.

Input and output typically fall into a few main categories: reading text files and other
more efficient on-disk formats, loading data from databases, and interacting with
network sources like web APIs.

# CHAPTER 7: Data Cleaning and Preparation
During the course of doing data analysis and modeling, a significant amount of time
is spent on data preparation: loading, cleaning, transforming, and rearranging. Such
tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the
way that data is stored in files or databases is not in the right format for a particular
task. Many researchers choose to do ad hoc processing of data from one form to
another using a general-purpose programming language, like Python, Perl, R, or Java,
or Unix text-processing tools like sed or awk. Fortunately, pandas, along with the
built-in Python language features, provides you with a high-level, flexible, and fast set
of tools to enable you to manipulate data into the right form.

If you identify a type of data manipulation that isn’t anywhere in this book or
elsewhere in the pandas library, feel free to share your use case on one of the
Python mailing lists or on the pandas GitHub site. Indeed, much of the design and
implementation of pandas have been driven by the needs of real-world applications.

In this chapter I discuss tools for missing data, duplicate data, string manipulation,
and some other analytical data transformations. In the next chapter, I focus on
combining and rearranging datasets in various ways.

# CHAPTER 8: Data Wrangling: Join, Combine, and Reshape
In many applications, data may be spread across a number of files or databases, or be
arranged in a form that is not convenient to analyze. This chapter focuses on tools to
help combine, join, and rearrange data.

First, I introduce the concept of hierarchical indexing in pandas, which is used exten‐
sively in some of these operations. I then dig into the particular data manipulations.
You can see various applied usages of these tools in Chapter 13.

# CHAPTER 9: Plotting and Visualization
Making informative visualizations (sometimes called plots) is one of the most impor‐
tant tasks in data analysis. It may be a part of the exploratory process—for example,
to help identify outliers or needed data transformations, or as a way of generating
ideas for models. For others, building an interactive visualization for the web may
be the end goal. Python has many add-on libraries for making static or dynamic
visualizations, but I’ll be mainly focused on matplotlib and libraries that build on top
of it.

matplotlib is a desktop plotting package designed for creating plots and figures
suitable for publication. The project was started by John Hunter in 2002 to enable
a MATLAB-like plotting interface in Python. The matplotlib and IPython commun‐
ities have collaborated to simplify interactive plotting from the IPython shell (and
now, Jupyter notebook). matplotlib supports various GUI backends on all operating
systems and can export visualizations to all of the common vector and raster graphics
formats (PDF, SVG, JPG, PNG, BMP, GIF, etc.). With the exception of a few diagrams,
nearly all of the graphics in this book were produced using matplotlib.

Over time, matplotlib has spawned a number of add-on toolkits for data visualization
that use matplotlib for their underlying plotting. One of these is seaborn, which we
explore later in this chapter.

# CHAPTER 10: Data Aggregation and Group Operations
Categorizing a dataset and applying a function to each group, whether an aggregation
or transformation, can be a critical component of a data analysis workflow. After
loading, merging, and preparing a dataset, you may need to compute group statistics
or possibly pivot tables for reporting or visualization purposes. pandas provides a
versatile groupby interface, enabling you to slice, dice, and summarize datasets in a
natural way.

One reason for the popularity of relational databases and SQL (which stands for
“structured query language”) is the ease with which data can be joined, filtered,
transformed, and aggregated. However, query languages like SQL impose certain
limitations on the kinds of group operations that can be performed. As you will see,
with the expressiveness of Python and pandas, we can perform quite complex group
operations by expressing them as custom Python functions that manipulate the data
associated with each group. In this chapter, you will learn how to:

*  Split a pandas object into pieces using one or more keys (in the form of func‐
tions, arrays, or DataFrame column names)
*  Calculate group summary statistics, like count, mean, or standard deviation, or a
user-defined function
*  Apply within-group transformations or other manipulations, like normalization,
linear regression, rank, or subset selection
*  Compute pivot tables and cross-tabulations
*  Perform quantile analysis and other statistical group analyses

# CHAPTER 11