# Pandas

<img src="PANDAS.png" width=1500 height=500 />


For reference follow the Pandas documentation at [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

# Data analysis with python - **Pandas**

- Pandas is a Python library that provides data structures and data analysis tools for handling and manipulating numerical tables and time series data.
-  It is built on top of the popular data manipulation library, numpy, and is widely used for data preparation and wrangling tasks in data science and machine learning workflows. 
- Some of the key features of pandas include its fast and efficient handling of large datasets, powerful data manipulation and cleaning capabilities, and support for a wide range of file formats and data sources.

## Main Features of PANDAS

The main features of PANDAS library

- **Easy handling of missing data:** Easy handling of missing data (represented as `NaN`, `NA`, or `NaT`) in floating point as well as non-floating point data 
- **Size mutability:** columns can be inserted and deleted from DataFrame and higher dimensional objects
- **Automatic and explicit data alignment:** objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let `Series`, `DataFrame`, etc. automatically align the data for you in computations
- **Groupby:** Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- **Data conversion:** Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- **Data manipulation:** 
    - Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
    - Intuitive merging and joining data sets
    - Flexible reshaping and pivoting of data sets
    - Hierarchical labeling of axes (possible to have multiple labels per tick)
- **Type of data handling:** Robust IO tools for loading data from flat files (`CSV` and `delimited`), Excel files, databases, and saving/loading data from the ultrafast `HDF5` format
- **Time series-specific functionality:** date range generation and frequency conversion, moving window statistics, date shifting and lagging.



![Python](https://img.shields.io/badge/python-3670A0?style=flat&logo=python&logoColor=ffdd54) ![Anaconda](https://img.shields.io/badge/Anaconda-%2344A833.svg?style=flat&logo=anaconda&logoColor=white) 
![NumPy](https://img.shields.io/badge/numpy-%23013243.svg?style=flat&logo=numpy&logoColor=white) ![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=flat&logo=pandas&logoColor=white)

## NumPy vs Pandas

<table><tr>
<td> <img src="Numpy-1.png" alt="Drawing" style="width: 650px;"/> </td>
<td> <img src="Pandas-1.png" alt="Drawing" style="width: 650px;"/> </td>
</tr></table>

[Image refeerence](https://favtutor.com/blogs/numpy-vs-pandas)

| **Comparison Parameter** |  **NumPy** | **Pandas** |
|----------------------|--------|--------|
| **Powerful Tool** | _A powerful tool of NumPy is Arrays_ | _A powerful tool of Pandas is Data frames and a Series_ |
| **Memory Consumption** | _NumPy is memory efficient_ | _Pandas consume more memory_ |
| **Data Compatibility** | _Works with numerical data_ | _Works with tabular data_ |
| **Performance** | _Better performance when the number of rows is 50K or less_ | _Better performance when the number of rows is 500k or more_ |
| **Speed** | _Faster than data frames_ | _Relatively slower than arrays_ |
| **Data Object** | _Creates “N” dimensional objects_ | _Creates “2D” objects_ |
| **Type of Data** | _Homogenous data type_ | _Heterogenous data type_ |
| **Access Methods** | _Using only index position_ | _Using index position or index labels_ |
| **Indexing** | _Indexing in NumPy arrays is very fast_ | _Indexing in Pandas series is very slow_ |
| **Operations** | _Does not have any additional functions_ | _Provides special utilities such as “groupby” to access and manipulate subsets_ |
| **External Data** | _Generally used data created by the user or built-in function_ | _Pandas object created by external data such as CSV, Excel, or SQL_ |
| **Application** | _NumPy is popular for numerical calculations_ | _Pandas is popular for data analysis and visualizations_ |
| **Usage in ML and AI** | _Toolkits can like TensorFlow and scikit can only be fed using NumPy arrays_ | _Pandas series cannot be directly fed as input toolkits_ |
| **Core Language** | _NumPy was written in C programming initially_ | _Pandas use R language for reference language_ |


Python libraries like NumPy and Pandas are often used together for data manipulations and numerical operations.

For more details on Numpy library, pleasee follow the Numpy notebook on my repository on Github: [Numy notebook on Github repository](https://github.com/arunsinp/Python-programming/blob/main/Python-fundamental/Numpy-tutorial.ipynb)

In [2]:
# To import Pandas library to your notebook
import pandas as pd
# For most times, numpy should also be imported to the notebook
import numpy as np

# Pandas Data Structures

Pandas supports two data structures:
- Series
- Dataframe

Both DataFrame and Series have powerful methods for handling missing data, performing data cleaning and wrangling, and handling time series data. They also provide built-in functions for statistical analysis, data visualization, and machine learning.

Let's start the two data structure one by one below.

1. **Series:** _A Series is a one-dimensional array-like object that can hold any data type. It is similar to a column in a DataFrame. Series can be created from various data types, including lists, numpy arrays, and dictionaries. Each element in a Series has a unique label, called the index, which can be used to access the elements in the Series_.

    Syntax: 
    
    `pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`

    Parameters:

    - `data`: array- Contains data stored in Series.
    - `index`: array-like or Index (1d)
    - `dtype`: str, numpy.dtype, or ExtensionDtype, optional
    - `name`: str, optional
    - `copy`: bool, default False

2. **Dataframe:** _A DataFrame is a two-dimensional size-mutable, tabular data structure with rows and columns. It is similar to a spreadsheet or a SQL table. DataFrames can be created from various sources including CSV files, Excel files, and SQL databases. They can also be created from lists, numpy arrays, and dictionaries. DataFrames can be manipulated using various methods such as indexing, slicing, and groupby._

    Syntex of the data frame creation:

    `pandas.DataFrame(data, index, columns)`

    where:

    * `data`: It is a dataset from which dataframe is to be created. It can be list, dictionary, scalar value, series, ndarrays, etc.
    * `index`: It is optional, by default the index of the dataframe starts from 0 and ends at the last data value(n-1). It defines the row label explicitly.
    * `columns`: This parameter is used to provide column names in the dataframe. If the column name is not defined by default, it will take a value from 0 to n-1.


    Data in higher dimensions are supported within DataFrame using a concept called hierarchical indexing. 
    
    Pandas DataFrame consists of three principal components, the data, rows, and columns. (A example dataframe is shown below [Reference](https://devopedia.org/pandas-data-structures)).

    <img src="https://devopedia.org/images/article/304/7205.1610253721.jpg" width=800 height=400/>

    Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.

    - Creating an empty dataframe
    - Creating a dataframe using List
    - Creating DataFrame from dict of ndarray/lists
    - Create pandas dataframe from lists using dictionary

    See examples below.

##### Example on series

In [16]:
# Series holding the char data type.
# a simple char list
list = ['P', 'A', 'N', 'D', 'A', 'S']
   
# create series form a char list
res = pd.Series(list)
print(res)

0    P
1    A
2    N
3    D
4    A
5    S
dtype: object


In [12]:
#Series holding the Int data type.
# a simple int list
list = [1,2,3,4,5]
   
# create series form a int list
res = pd.Series(list)
print(res)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [17]:
# Series holding the dictionary.
dic = { 'Id': 1013, 
        'Name': 'Pandas',
        'State': 'Python',
        'Age': 2008}
 
res = pd.Series(dic)
print(res)

Id         1013
Name     Pandas
State    Python
Age        2008
dtype: object


##### Examples on dataframe

In [14]:
# Creating an empty dataframe
# Calling DataFrame constructor
df = pd.DataFrame()
 
print(df)

Empty DataFrame
Columns: []
Index: []


In [18]:
# Creating a dataframe using List
# list of strings
lst = ['Python', 'was', 'first', 'time', 'released', 'on', 'Jan 11, 2008']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

              0
0        Python
1           was
2         first
3          time
4      released
5            on
6  Jan 11, 2008


In [20]:
# Creating DataFrame from dict of ndarray/lists
# initialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


In [21]:
# Creating pandas dataframe from lists using dictionary
# dictionary of lists
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
df = pd.DataFrame(dict)
 
print(df)


     name  degree  score
0  aparna     MBA     90
1  pankaj     BCA     40
2  sudhir  M.Tech     80
3   Geeku     MBA     98


In [22]:
# creating a DataFrame by proving index label explicitly.
# initialize data of lists.
data = {'Name': ['Tom', 'Jack', 'nick', 'juli'],
        'marks': [99, 98, 95, 90]}
  
# Creates pandas DataFrame.
df = pd.DataFrame(data, index=['rank1',
                               'rank2',
                               'rank3',
                               'rank4'])
  
# print the data
df

Unnamed: 0,Name,marks
rank1,Tom,99
rank2,Jack,98
rank3,nick,95
rank4,juli,90


In [24]:
# creating pandas DataFrame by passing lists of dictionaries and row indexes.
# Initialize data of lists
data = [{'b': 2,            'c': 3}, 
        {'a': 10, 'b': 20, 'c': 30}
        ]
  
# Creates pandas DataFrame by passing
# Lists of dictionaries and row index.
df = pd.DataFrame(data, index=['first', 'second'])
  
# Print the data
df

Unnamed: 0,b,c,a
first,2,3,
second,20,30,10.0


In [26]:
# above example can also be done as:
data = {'a' : ['NaN', 10.0],
          'b' : [2, 20],
          'c' : [3,30]}
        
  
# Creates pandas DataFrame by passing
# Lists of dictionaries and row index.
df = pd.DataFrame(data, index=['first', 'second'])
  
# Print the data
df

Unnamed: 0,a,b,c
first,,2,3
second,10.0,20,30


In [29]:
# creating pandas DataFrame from lists of dictionaries with both row index as well as column index.
# Initialize lists data.
data = [{'a': 1, 'b': 2},
        {'a': 5, 'b': 10, 'c': 20}]
  
# With two column indices, values same
# as dictionary keys
df1 = pd.DataFrame(data, index=['first',
                                'second'],
                   columns=['a', 'b'])
  
# With two column indices with
# one index with other name
df2 = pd.DataFrame(data, index=['first',
                                'second'],
                   columns=['a', 'b1'])
  
# print for first data frame
print(df1, "\n")
  
# Print for second DataFrame.
print(df2)

        a   b
first   1   2
second  5  10 

        a  b1
first   1 NaN
second  5 NaN


# i. Pandas Series functions

Pandas Series is a one-dimensional array-like object that can hold any data type. It is similar to a column in a spreadsheet or a DataFrame.  

`class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`

- `data`:  Contains data stored in Series. If data is a dict, argument order is maintained.
- `index`: Will default to RangeIndex (0, 1, 2, …, n) if not provided.
- `dtype` (Optional): If not specified, this will be inferred from data.
- `name` (optional): The name to give to the Series.
- `copy`: bool, default False (Copy input data.)

**Few important functions**

| Functions | Description | 
|-----------|-------------|
| `Series.array` | returns the underlying numpy array from data. | 
| `Series.head(n)` | returns the first n rows of the Series. | 
| `Series.tail(n)` |  returns the last n rows of the Series. |
| `Series.unique()` | returns an array of unique values in the Series. |
| `Series.value_counts()` |  returns a Series containing the count of unique values. |
| `Series.nunique()` | returns number of unique elements in the Series. |
| `Series.sort_values(ascending=True)` | sorts the Series in ascending or descending order. |
| `Series.sort_index()` |  sorts the Series by its index. |
| `Series.mean()` | returns the mean of the Series. |
| `Series.median()` | returns the median of the Series. |
| `Series.sum()` | returns the sum of the Series.|
| `Series.min()` |  returns the minimum value in the Series. |
| `Series.max()` | returns the maximum value in the Series.|
| `Series.describe()` | returns a summary of statistics of the Series.|
| `Series.apply(function)` | applies a function to each element in the Series.|
| `Series.map(function)` | applies a function to each element in the Series and returns a new Series. | 
| `Series.str.contains(string)` | returns a boolean mask indicating whether each element in the Series contains a given string. | 
| `Series.str.replace(old, new)` | replaces all occurrences of a string in the Series with another string. |
| `Series.astype(dtype)` | converts the data type of the Series to the specified data type.|

### Series attributes

| Function | Description | Importance |
|----------|-------------|---------------|
| `Series.index` | The index (axis labels) of the Series. | 👍 |
| `Series.array` | The ExtensionArray of the data backing this Series or Index. |  |
| `Series.values` | Return Series as ndarray or ndarray-like depending on the dtype. | 👍 |
| `Series.dtype` | Return the dtype object of the underlying data. | | 
| `Series.shape` | Return a tuple of the shape of the underlying data. | 👍 | 
| `Series.nbytes` | Return the number of bytes in the underlying data. | | 
| `Series.ndim` | Number of dimensions of the underlying data, by definition 1. |  👍 | 
| `Series.size` | Return the number of elements in the underlying data.| 👍 | 
| `Series.T` | Return the transpose, which is by definition self. | 👍  |
| `Series.memory_usage([index, deep])` | Return the memory usage of the Series. |  | 
| `Series.hasnans` | Return True if there are any NaNs. |  | 
| `Series.empty` | Indicator whether Series/DataFrame is empty. |  | 
| `Series.name` | Return the name of the Series. | 👍 | 
| `Series.flags` | Get the properties associated with this pandas object. |  | 
| `Series.set_flags(*[, copy, ...])` | Return a new object with updated flags. | |

In [144]:
# Example: on series data attributes
import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(6), index=['p', 'q', 'r', 'n', 't','v'], name= "My_series")

print("\n The new created series is: \n", s)

print("\n The numpy array from the series is:\n ", s.array) # print series as a numpy array

print("\n The numpy array from the series is:\n ", s.values)
print("\n Transpose of the series: \n", s.T)


 The new created series is: 
 p   -0.341835
q    0.001739
r   -0.319288
n    0.106556
t    0.506951
v   -0.522571
Name: My_series, dtype: float64

 The numpy array from the series is:
  <PandasArray>
[  -0.3418352209700609, 0.0017387605457481053,   -0.3192884538318195,
   0.10655625844099291,    0.5069514787023305,   -0.5225710808444672]
Length: 6, dtype: float64

 The numpy array from the series is:
  [-0.34183522  0.00173876 -0.31928845  0.10655626  0.50695148 -0.52257108]

 Transpose of the series: 
 p   -0.341835
q    0.001739
r   -0.319288
n    0.106556
t    0.506951
v   -0.522571
Name: My_series, dtype: float64


In [145]:
print("\n The index corresponding to the elements of the series is:\n", s.index)
print("\n The data type of the series: ", s.dtype)
print("\n The shape of the series: ", s.shape)
print("\n Number of bytes in the underlying data: ", s.nbytes)
print("\n Number of dimensions of the underlying data: ", s.ndim)
print("\n The number of elements in the underlying data: ", s.size)
print("\n The name of the Series: ", s.name)


 The index corresponding to the elements of the series is:
 Index(['p', 'q', 'r', 'n', 't', 'v'], dtype='object')

 The data type of the series:  float64

 The shape of the series:  (6,)

 Number of bytes in the underlying data:  48

 Number of dimensions of the underlying data:  1

 The number of elements in the underlying data:  6

 The name of the Series:  My_series


### Conversion of series data type

| Function | Description | Importance |
|----------|-------------|---------------|
| `Series.astype(dtype[, copy, errors])` | Converts the data type of the Series to the specified data type. | |
| `Series.convert_dtypes([infer_objects, ...])` | Convert columns to best possible dtypes using dtypes supporting `pd.NA`. | |
| `Series.infer_objects()` | Attempt to infer better dtypes for object columns. | |
| `Series.copy([deep])` | Make a copy of this object's indices and data. (`deep = True` : Make a deep copy, including a copy of the data and the indices. With `deep=False` neither the indices nor the data are copied.) | 👍 |
| `Series.bool()` | Return the bool of a single element Series or DataFrame. | |
| `Series.to_numpy([dtype, copy, na_value])` | A NumPy ndarray representing the values in this Series or Index. |  |
| `Series.to_period([freq, copy])` | Convert Series from DatetimeIndex to PeriodIndex. | |
| `Series.to_timestamp([freq, how, copy])` | Cast to DatetimeIndex of Timestamps, at beginning of period. | |
| `Series.to_list()` | Return a list of the values. | |
| `Series.__array__([dtype])` | Return the values as a NumPy array. | |

In [146]:
# Create a sample series
s = pd.Series([1, 2, 3, 4, 5], dtype = int)
print("Original Series:\n", s)


# Change the data type of the series to float
s = s.astype(float)

print("Series after changing data type to float:\n", s)

Original Series:
 0    1
1    2
2    3
3    4
4    5
dtype: int64
Series after changing data type to float:
 0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64


In [147]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
print("The series dataframe:\n",df)
print("\n")
print("Give information on the data type:\n", df.dtypes)
print("\n")
df.astype('int32').dtypes


The series dataframe:
    col1  col2
0     1     3
1     2     4


Give information on the data type:
 col1    int64
col2    int64
dtype: object




col1    int32
col2    int32
dtype: object

In [148]:
df_series = pd.DataFrame(
    {
        "a": pd.Series([1, 2, 3],              dtype=np.dtype("int32")),
        "b": pd.Series(["x", "y", "z"],        dtype=np.dtype("O")),
        "c": pd.Series([True, False, np.nan],  dtype=np.dtype("O")),
        "d": pd.Series(["h", "i", np.nan],     dtype=np.dtype("O")),
        "e": pd.Series([10, np.nan, 20],       dtype=np.dtype("float")),
        "f": pd.Series([np.nan, 100.5, 200],   dtype=np.dtype("float")),
    }
)
print("The dataframe created from a series:\n", df_series)
print("Data tyoe:\n", df_series.dtypes)
dfn = df.convert_dtypes() # this will convert the datatype of the series to it's actual form
print("Best Dtype of the DataFrame:\n", dfn.dtypes) # Checking the datatypes of dfn series

The dataframe created from a series:
    a  b      c    d     e      f
0  1  x   True    h  10.0    NaN
1  2  y  False    i   NaN  100.5
2  3  z    NaN  NaN  20.0  200.0
Data tyoe:
 a      int32
b     object
c     object
d     object
e    float64
f    float64
dtype: object
Best Dtype of the DataFrame:
 col1    Int64
col2    Int64
dtype: object


In [149]:
s = pd.Series([1, 2], index=["a", "b"])
print("The series is:\n", s)
print("\n")

s_copy = s.copy()
print("New copy of the data series:\n", s_copy)

The series is:
 a    1
b    2
dtype: int64


New copy of the data series:
 a    1
b    2
dtype: int64


### Indexing, iteration

| Function | Description | Importance |
|----------|-------------|------------|
| `Series.get(key[, default])` | Get item from object for given key (ex: DataFrame column). | 👍 |
| `Series.at` | Access a single value for a row/column label pair. | 👍 |
| `Series.iat` | Access a single value for a row/column pair by integer position. |  👍 |
| `Series.loc` | Access a group of rows and columns by label(s) or a boolean array. | 👍 | 
| `Series.iloc` | Purely integer-location based indexing for selection by position. | |
| `Series.__iter__()` | Return an iterator of the values. | |
| `Series.items()` | Lazily iterate over (index, value) tuples. | 👍 |
| `Series.iteritems()` | (DEPRECATED) Lazily iterate over (index, value) tuples. | |
| `Series.keys()` | Return alias for index. | 👍 |
| `Series.pop(item)` | Return item and drops from series. | 👍 |
| `Series.item()` | Return the first element of the underlying data as a Python scalar. | 👍 |
| `eries.xs(key[, axis, level, drop_level])` | Return cross-section from the Series/DataFrame. | |

In [150]:
# Exampl: 1
# Series.get(key, default=None)
df = pd.DataFrame(
    [
        [24.3, 75.7, "high"],
        [31, 87.8, "high"],
        [22, 71.6, "medium"],
        [35, 95, "medium"],
    ],
    columns=["temp_celsius", "temp_fahrenheit", "windspeed"],
    index=pd.date_range(start="2014-02-12", end="2014-02-15", freq="D"),
)
print("Series is:\n", df)

print("\n")
df_get = df.get(["temp_celsius", "windspeed"])
print("The two columns are:\n", df_get)

Series is:
             temp_celsius  temp_fahrenheit windspeed
2014-02-12          24.3             75.7      high
2014-02-13          31.0             87.8      high
2014-02-14          22.0             71.6    medium
2014-02-15          35.0             95.0    medium


The two columns are:
             temp_celsius windspeed
2014-02-12          24.3      high
2014-02-13          31.0      high
2014-02-14          22.0    medium
2014-02-15          35.0    medium


In [151]:
# Example: 2
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
                  index=['i', 'ii', 'iii'], columns=['A', 'B', 'C'])
print("The series is:\n", df)

df.at['i', 'B']

The series is:
       A   B   C
i     0   2   3
ii    0   4   1
iii  10  20  30


2

In [152]:
print(df.loc['iii'])
print(df.loc['iii'].at['B'])

A    10
B    20
C    30
Name: iii, dtype: int64
20


In [153]:
df.iat[2, 2]

30

In [154]:
df.loc['i'].iat[1]

2

In [155]:
print(df.loc[['i', 'ii']])

    A  B  C
i   0  2  3
ii  0  4  1


In [156]:
# Example: 3
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
     index=['cobra', 'viper', 'sidewinder'],
     columns=['max_speed', 'shield'])
print("The dataframe is:\n")
print(df)

The dataframe is:

            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8


In [159]:
print("Acessing one row:")
print(df.loc['viper'])#this returns the row as a Series.

print("\n")
print("accessing two rows:")
print(df.loc[['viper', 'sidewinder']]) #returns a DataFrame.
print("\n")
print(df.loc['cobra', 'shield']) #Single label for row and column
print("\n")
print(df.loc['cobra':'viper', 'max_speed'])# Slice with labels for row and single label for column

Acessing one row:
max_speed    4
shield       5
Name: viper, dtype: int64


accessing two rows:
            max_speed  shield
viper               4       5
sidewinder          7       8


2


cobra    1
viper    4
Name: max_speed, dtype: int64


In [160]:
df.loc[[False, False, True]]

Unnamed: 0,max_speed,shield
sidewinder,7,8


# ii. Pandas DataFrame functions

## 1. Input/output

### 1a. Pickling

| Function | Explanation |
|----------|-------------|
| `read_pickle(filepath_or_buffer[, ...])` | Load pickled pandas object (or any object) from file. |
| `DataFrame.to_pickle(path[, compression, ...])` |  Pickle (serialize) object to file. |

### 1b. Flat file

| Function | Explanation |
|----------|-------------|
| `read_table(filepath_or_buffer, *[, sep, ...])` | Read general delimited file into DataFrame.|
| `read_csv(filepath_or_buffer, *[, sep, ...])` | Read a comma-separated values (csv) file into DataFrame. |
| `DataFrame.to_csv([path_or_buf, sep, na_rep, ...])` | Write object to a comma-separated values (csv) file. |
| `read_fwf(filepath_or_buffer, *[, colspecs, ...])` | Read a table of fixed-width formatted lines into DataFrame. |

#### 1ba. read_table

The `.read_table()` function in pandas is used to read tabular data from a file or a string and return a DataFrame.

It is used in following scenarios:

|Sr. | Scenario | syntex |
|----|----------|--------|
| 1 | Reading a CSV file |   `df = pd.read_table('data.csv', sep=',')`|
| 2 | Reading a tab-separated file | `df = pd.read_table('data.tsv')` |
| 3 | Reading data from a URL | `df = pd.read_table(url, sep=',')` |

Below some example are given.

In [38]:
# From csv file
original_df1 = pd.read_table('nba-data.csv',delimiter=',')
original_df1

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [41]:
original_df2 = pd.read_table('nba-data.csv', delimiter=',', skiprows=4,index_col=0)

original_df2 #here four rows are skipped and the last skipped row is displayed.

Unnamed: 0_level_0,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
R.J. Hunter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
...,...,...,...,...,...,...,...,...
Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [43]:
# Skipping rows with indexing
original_df3 = pd.read_table('nba-data.csv',delimiter=',',skiprows=4)

original_df3

Unnamed: 0,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
0,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
1,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
2,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
3,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
4,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
...,...,...,...,...,...,...,...,...,...
449,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
450,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
451,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
452,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [89]:
# In case of large file, if you want to read only few lines then give required number of lines to nrows.
pd.read_table('nba-data.csv',delimiter=',',index_col=0,nrows=10) # ihave used nrows to reduce the memory used in the notebook 

Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [91]:
# to skip lines from the bottom
pd.read_table('nba-data.csv',delimiter=',',index_col=0,
                     engine='python',skipfooter=5)

Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...
Gordon Hayward,Utah Jazz,20.0,SF,26.0,6-8,226.0,Butler,15409570.0
Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0
Joe Ingles,Utah Jazz,2.0,SF,28.0,6-8,226.0,,2050000.0
Chris Johnson,Utah Jazz,23.0,SF,26.0,6-6,206.0,Dayton,981348.0


In [84]:
# Row number(s) to use as the column names, and the start of the data occurs after the last row number given in header.
pd.read_table('nba-data.csv',delimiter=',', index_col=0, header=[1,3,5], nrows =4)

Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Unnamed: 8_level_1
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Unnamed: 7_level_2,5000000.0
Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0


#### 1bb. pandas.read_csv()

The syntex of this is:

`pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,  index_col=None, usecols=None, engine=None, skiprows=None, nrows=None)`

The `pd.read_csv()` function is a pandas function used to read a CSV (comma-separated values) file and convert it into a pandas DataFrame. The function takes several parameters:

- `filepath_or_buffer`: The path to the CSV file or a buffer like object.
- `sep`: The delimiter used in the CSV file. Default is ','.
- `header`: The row number(s) to use as the column names. Default is 'infer', which means pandas will try to infer the column names from the first row of the CSV file.
- `index_col`: The column to use as the row index. Default is None.
- `usecols`: The subset of columns to use in the DataFrame. Default is None, which means all columns will be used.
- `engine`: The parsing engine to use. Default is None, which means pandas will try to use the best engine for the task.
- `skiprows`: The number of rows to skip at the beginning of the file. Default is None.
- `nrows`: The number of rows to use in the DataFrame. Default is None, which means all rows will be used.



In [81]:
pd.read_csv("nba-data.csv", nrows=4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0


In [82]:
# without delimeter, it will give  a plane table here. However in read_csv case, without delimter it still gives the table with all the details
pd.read_table('nba-data.csv',delimiter=',', nrows=4)

Unnamed: 0,"Name,Team,Number,Position,Age,Height,Weight,College,Salary"
0,"Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,1..."
1,"Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,23..."
2,"John Holland,Boston Celtics,30.0,SG,27.0,6-5,2..."
3,"R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,18..."


In [83]:
# Load the data of csv
df = pd.read_csv('nba-data.csv', sep=',', engine='python', header=1, nrows=4)
df

Unnamed: 0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
0,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
1,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
2,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
3,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0


### 1c. Clipboard

| Function | Explanation |
|----------|-------------|
| `read_clipboard([sep])` | Read text from clipboard and pass to read_csv. |
| `DataFrame.to_clipboard([excel, sep])` | Copy object to the system clipboard. |

### 1d. Excel

| Function | Explanation |
|----------|-------------|
| `read_excel(io[, sheet_name, header, names, ...])` | Read an Excel file into a pandas DataFrame. | 
| `DataFrame.to_excel(excel_writer[, ...])` | Write object to an Excel sheet. |
| `ExcelFile.parse([sheet_name, header, names, ...])` | Parse specified sheet(s) into a DataFrame. |
| `Styler.to_excel(excel_writer[, sheet_name, ...])` | Write Styler to an Excel sheet. |
| `ExcelWriter(path[, engine, date_format, ...])` | Class for writing DataFrame objects into excel sheets. |

### 1e. JSON

| Function | Explanation |
|----------|-------------|
| `read_json(path_or_buf, *[, orient, typ, ...])` | Convert a JSON string to pandas object. | 
| `json_normalize(data[, record_path, meta, ...])` | Normalize semi-structured JSON data into a flat table. | 
| `DataFrame.to_json([path_or_buf, orient, ...])` | Convert the object to a JSON string. | 
| `build_table_schema(data[, index, ...])` | Create a Table schema from data. |

### 1f. HTML

| Function | Explanation |
|----------|-------------|
| `read_html(io, *[, match, flavor, header, ...])` | Read HTML tables into a list of DataFrame objects. |
| `DataFrame.to_html([buf, columns, col_space, ...])` | Render a DataFrame as an HTML table. | 
| `Styler.to_html([buf, table_uuid, ...])` | Write Styler to a file, buffer or string in HTML-CSS format. |

### 1g. XML

| Function | Explanation |
|----------|-------------|
| `read_xml(path_or_buffer, *[, xpath, ...])` | Read XML document into a DataFrame object. | 
| `DataFrame.to_xml([path_or_buffer, index, ...])` | Render a DataFrame to an XML document. |

### 1h. Latex

| Function | Explanation |
|----------|-------------|
| `DataFrame.to_latex([buf, columns, ...])` | Render object to a LaTeX tabular, longtable, or nested table. |
| `Styler.to_latex([buf, column_format, ...])` | Write Styler to a file, buffer or string in LaTeX format. |

### 1i. HDFStore: PyTables (HDF5)

| Function | Explanation |
|----------|-------------|
| `read_hdf(path_or_buf[, key, mode, errors, ...])` | Read from the store, close it if we opened it. | 
| `HDFStore.put(key, value[, format, index, ...])` | Store object in HDFStore. | 
| `HDFStore.append(key, value[, format, axes, ...])` | Append to Table in file. | 
| `HDFStore.get(key)` | Retrieve pandas object stored in file. | 
| `HDFStore.select(key[, where, start, stop, ...])` | Retrieve pandas object stored in file, optionally based on where criteria. | 
| `HDFStore.info()` | Print detailed information on the store. |
| `HDFStore.keys([include])` | Return a list of keys corresponding to objects stored in HDFStore. | 
| `HDFStore.groups()` | Return a list of all the top-level nodes. |
| `HDFStore.walk([where])` | Walk the pytables group hierarchy for pandas objects. |

### 1j. SQL
| Function | Explanation |
|----------|-------------|
| `read_sql_table(table_name, con[, schema, ...])` | Read SQL database table into a DataFrame. |
| `read_sql_query(sql, con[, index_col, ...])` |  Read SQL query into a DataFrame. |
| `read_sql(sql, con[, index_col, ...])` | Read SQL query or database table into a DataFrame. |
| `DataFrame.to_sql(name, con[, schema, ...])` | Write records stored in a DataFrame to a SQL database. |

## 2. Data Manipulation functions 

Pandas is a popular data manipulation library for Python that provides many useful functions for manipulating and analyzing data.


| Function | Explanation |
|----------|-------------|
| `melt(frame[, id_vars, value_vars, var_name, ...])`  |  Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.  |
| `pivot(data, *[, index, columns, values])` | Return reshaped DataFrame organized by given index / column values. | 
| `pivot_table(data[, values, index, columns, ...])` | Create a spreadsheet-style pivot table as a DataFrame. |
| `crosstab(index, columns[, values, rownames, ...])` | Compute a simple cross tabulation of two (or more) factors. |
| `cut(x, bins[, right, labels, retbins, ...])` | Bin values into discrete intervals. | 
| `qcut(x, q[, labels, retbins, precision, ...])` | Quantile-based discretization function. | 
| `merge(left, right[, how, on, left_on, ...])` | Merge DataFrame or named Series objects with a database-style join. |
| `merge_ordered(left, right[, on, left_on, ...])` | Perform a merge for ordered data with optional filling/interpolation. | 
| `merge_asof(left, right[, on, left_on, ...])` | Perform a merge by key distance. | 
| `concat(objs, *[, axis, join, ignore_index, ...])` | Concatenate pandas objects along a particular axis. |
| `get_dummies(data[, prefix, prefix_sep, ...])` | Convert categorical variable into dummy/indicator variables. | 
| `from_dummies(data[, sep, default_category])` | Create a categorical DataFrame from a DataFrame of dummy variables. | 
| `factorize(values[, sort, na_sentinel, ...])` | Encode the object as an enumerated type or categorical variable. | 
| `unique(values)` | Return unique values based on a hash table. | 
| `wide_to_long(df, stubnames, i, j[, sep, suffix])` | Unpivot a DataFrame from wide to long format. |

# References

1. https://pandas.pydata.org/docs/user_guide/index.html#user-guide
2. https://pandas.pydata.org/docs/
3. https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
4. https://www.w3resource.com/pandas/index.php