# Basics of Pandas

Pandas is a powerful and versatile open-source data manipulation and analysis library for Python. It is widely used in data science and machine learning due to its rich data structures and functions designed to make data manipulation and analysis easy and efficient. Here are the basics you need to know about pandas:

### Key Features
1. **Data Structures**: Pandas provides two primary data structures:
   - **Series**: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, etc.).
   - **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table.

2. **Data Alignment**: Automatic alignment of data for operations on data structures, making it easy to manage missing data.

3. **Data Cleaning and Preparation**: Functions for handling missing data, removing duplicates, transforming data, and more.

4. **Data Aggregation and Grouping**: Powerful group-by functionality to perform split-apply-combine operations on data sets.

5. **Time Series Support**: Functions for working with time series data, including date range generation and frequency conversion.

6. **Integration with Other Libraries**: Easily integrates with other Python libraries like NumPy, SciPy, Matplotlib, and scikit-learn.

### Getting Started with Pandas

#### Installation
To install pandas, you can use pip:

```bash
pip install pandas
```

#### Importing Pandas
To use pandas, you need to import it in your Python script:

In [1]:
import pandas as pd

### Creating Data Structures

#### Creating a Series
You can create a Series by passing a list, dictionary, or scalar value:

In [2]:
import pandas as pd
import numpy as np

# From a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# From a dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
a    1
b    2
c    3
dtype: int64


#### Creating a DataFrame
You can create a DataFrame by passing a dictionary of lists, a list of dictionaries, or a NumPy array:

In [3]:
import pandas as pd
import numpy as np

# From a dictionary of lists
data = {'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)
print(df)

# From a list of dictionaries
data = [{'A': 1, 'B': 2}, {'A': 5, 'B': 10}]
df = pd.DataFrame(data)
print(df)

# From a NumPy array
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

   A  B
0  1  5
1  2  6
2  3  7
3  4  8
   A   B
0  1   2
1  5  10
   A  B  C
0  1  2  3
1  4  5  6


### DataFrame Operations

#### Viewing Data

In [4]:
# Display the first few rows
print(df.head())

# Display the last few rows
print(df.tail())

# Display the DataFrame's index, columns, and values
print(df.index)
print(df.columns)
print(df.values)

# Basic statistical details
print(df.describe())

# Transpose the DataFrame
print(df.T)

# Sort by index
print(df.sort_index(axis=1, ascending=False))

# Sort by values
print(df.sort_values(by='B'))

   A  B  C
0  1  2  3
1  4  5  6
   A  B  C
0  1  2  3
1  4  5  6
RangeIndex(start=0, stop=2, step=1)
Index(['A', 'B', 'C'], dtype='object')
[[1 2 3]
 [4 5 6]]
             A        B        C
count  2.00000  2.00000  2.00000
mean   2.50000  3.50000  4.50000
std    2.12132  2.12132  2.12132
min    1.00000  2.00000  3.00000
25%    1.75000  2.75000  3.75000
50%    2.50000  3.50000  4.50000
75%    3.25000  4.25000  5.25000
max    4.00000  5.00000  6.00000
   0  1
A  1  4
B  2  5
C  3  6
   C  B  A
0  3  2  1
1  6  5  4
   A  B  C
0  1  2  3
1  4  5  6


#### Selection

In [5]:
# Selecting a single column
print(df['A'])

# Selecting by row index
print(df[0:3])

# Selecting by label
print(df.loc[0])

# Selecting by position
print(df.iloc[0])

# Boolean indexing
print(df[df['A'] > 2])

0    1
1    4
Name: A, dtype: int32
   A  B  C
0  1  2  3
1  4  5  6
A    1
B    2
C    3
Name: 0, dtype: int32
A    1
B    2
C    3
Name: 0, dtype: int32
   A  B  C
1  4  5  6


### Handling Missing Data

In [6]:
# Detect missing values
print(df.isna())

# Drop rows with missing values
df.dropna()

# Fill missing values
df.fillna(value=5)

       A      B      C
0  False  False  False
1  False  False  False


Unnamed: 0,A,B,C
0,1,2,3
1,4,5,6


### Grouping

In [7]:
# Group by a column and aggregate
grouped = df.groupby('A').sum()
print(grouped)

   B  C
A      
1  2  3
4  5  6


### Merging and Joining

In [8]:
# Concatenating DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})
result = pd.concat([df1, df2])
print(result)

# Merging DataFrames
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'A': ['A0', 'A1', 'A2']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'B': ['B0', 'B1', 'B2']})
result = pd.merge(left, right, on='key')
print(result)

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5
  key   A   B
0  K0  A0  B0
1  K1  A1  B1
2  K2  A2  B2


### Input/Output

In [None]:
# Reading from a CSV file
df = pd.read_csv('data.csv')

# Writing to a CSV file
df.to_csv('output.csv')

# Reading from an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Writing to an Excel file
df.to_excel('output.xlsx', sheet_name='Sheet1')

These are the fundamental concepts and operations in pandas. Mastering these basics will allow you to perform a wide range of data manipulation and analysis tasks.

Certainly! Let's dive a bit deeper into some additional basic functionalities of pandas that are frequently used in data analysis.

### Advanced DataFrame Operations

#### Column Operations
You can add, modify, and delete columns in a DataFrame easily.
```python
# Add a new column
df['C'] = df['A'] + df['B']

# Modify an existing column
df['A'] = df['A'] * 2

# Delete a column
df = df.drop('C', axis=1)
```

#### Renaming
You can rename the columns or the index of a DataFrame.
```python
# Rename columns
df = df.rename(columns={'A': 'Alpha', 'B': 'Beta'})

# Rename index
df = df.rename(index={0: 'first', 1: 'second'})
```

#### Setting and Resetting Index
You can set a column as the index of a DataFrame and reset it back to default integer index.
```python
# Set a column as index
df = df.set_index('Alpha')

# Reset index
df = df.reset_index()
```

### DataFrame Methods

#### Apply Method
The `apply` method allows you to apply a function along an axis of the DataFrame.
```python
# Applying a function to each column
df.apply(np.sqrt)

# Applying a function to each row
df.apply(lambda x: x.max() - x.min(), axis=1)
```

#### Applymap Method
The `applymap` method applies a function to each element of the DataFrame.
```python
# Applying a function to each element
df.applymap(lambda x: x * 2)
```

#### Map Method
The `map` method is used for element-wise operations on Series.
```python
# Mapping a function to a Series
df['Alpha'] = df['Alpha'].map(lambda x: x * 2)
```

### Handling Duplicates
Pandas provides methods to handle duplicate data in your DataFrame.
```python
# Detecting duplicates
df.duplicated()

# Removing duplicates
df = df.drop_duplicates()
```

### Combining DataFrames
Besides `concat` and `merge`, there are other ways to combine DataFrames.

#### Join Method
The `join` method is useful for combining DataFrames with a shared index.
```python
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2']}, index=['K0', 'K1', 'K2'])
df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2'])
result = df1.join(df2)
print(result)
```

### Pivot Tables
Pivot tables are useful for summarizing data.
```python
# Creating a pivot table
pivot_df = df.pivot_table(values='B', index='A', columns='C', aggfunc=np.mean)
print(pivot_df)
```

### Working with Time Series Data
Pandas has strong support for working with time series data.

#### Date Range Generation
```python
# Generating a date range
dates = pd.date_range('20230101', periods=6)
print(dates)
```

#### Converting to Datetime
```python
# Converting a column to datetime
df['date'] = pd.to_datetime(df['date'])
```

#### Resampling
Resampling is used to convert a time series from one frequency to another.
```python
# Resampling to monthly frequency
df.set_index('date').resample('M').sum()
```

### Data Input/Output (continued)
In addition to CSV and Excel, pandas can read and write to a variety of formats.

#### Reading and Writing JSON
```python
# Reading from a JSON file
df = pd.read_json('data.json')

# Writing to a JSON file
df.to_json('output.json')
```

#### Reading and Writing SQL
You can read from and write to SQL databases using pandas.
```python
from sqlalchemy import create_engine

# Create an engine
engine = create_engine('sqlite:///:memory:')

# Write to SQL
df.to_sql('table_name', engine)

# Read from SQL
df = pd.read_sql('table_name', engine)
```

### Visualization
Pandas integrates with Matplotlib for data visualization.
```python
import matplotlib.pyplot as plt

# Simple line plot
df.plot()
plt.show()

# Bar plot
df.plot(kind='bar')
plt.show()
```

### Best Practices
1. **Use Vectorized Operations**: Avoid looping through rows and columns. Instead, use vectorized operations for better performance.
2. **Understand Data Types**: Know the data types of your columns and convert them if necessary for better performance and memory usage.
3. **Chain Methods**: Pandas methods can often be chained together for cleaner and more readable code.

By mastering these additional basics, you'll be well-equipped to handle a wide range of data manipulation and analysis tasks using pandas.

## Excel File

To read an Excel file in pandas, you use the `read_excel` function. This function allows you to import data from an Excel file into a pandas DataFrame.

### Syntax
```python
pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True, storage_options=None)
```

### Parameters
Some of the commonly used parameters include:
- **io**: The file path or object, or URL of the Excel file to be read.
- **sheet_name**: The sheet to read. This can be the sheet name (string), sheet index (integer), or a list of sheet names/indices.
- **header**: Row number(s) to use as the column names. Defaults to the first row (0).
- **names**: List of column names to use.
- **index_col**: Column(s) to set as index (row labels).
- **usecols**: Return a subset of the columns.
- **dtype**: Data type for data or columns.

### Example

In [2]:
import pandas as pd

# Read an Excel file
df = pd.read_excel('LUSID Excel - Setting up your market data.xlsx ', sheet_name='Datetime format')

# Display the DataFrame
print(df)

    Unnamed: 0  Unnamed: 1  Unnamed: 2  \
0          NaN         NaN         NaN   
1          NaN         NaN         NaN   
2          NaN         NaN         NaN   
3          NaN         NaN         NaN   
4          NaN         NaN         NaN   
5          NaN         NaN         NaN   
6          NaN         NaN         NaN   
7          NaN         NaN         NaN   
8          NaN         NaN         NaN   
9          NaN         NaN         NaN   
10         NaN         NaN         NaN   
11         NaN         NaN         NaN   
12         NaN         NaN         NaN   
13         NaN         NaN         NaN   
14         NaN         NaN         NaN   
15         NaN         NaN         NaN   
16         NaN         NaN         NaN   
17         NaN         NaN         NaN   
18         NaN         NaN         NaN   
19         NaN         NaN         NaN   
20         NaN         NaN         NaN   
21         NaN         NaN         NaN   
22         NaN         NaN        

In this example:
- The `read_excel` function reads the Excel file specified by `'path_to_your_file.xlsx'`.
- The `sheet_name` parameter specifies which sheet to read. In this case, `'Sheet1'`.

### Reading Multiple Sheets

If you want to read multiple sheets from an Excel file into a dictionary of DataFrames:

In [3]:
import pandas as pd

# Read multiple sheets
dfs = pd.read_excel('LUSID Excel - Setting up your market data.xlsx', sheet_name=['Datetime format', 'List identifiers'])

# Display the DataFrame for 'Sheet1'
print(dfs['Datetime format'])

# Display the DataFrame for 'Sheet2'
print(dfs['List identifiers'])

    Unnamed: 0  Unnamed: 1  Unnamed: 2  \
0          NaN         NaN         NaN   
1          NaN         NaN         NaN   
2          NaN         NaN         NaN   
3          NaN         NaN         NaN   
4          NaN         NaN         NaN   
5          NaN         NaN         NaN   
6          NaN         NaN         NaN   
7          NaN         NaN         NaN   
8          NaN         NaN         NaN   
9          NaN         NaN         NaN   
10         NaN         NaN         NaN   
11         NaN         NaN         NaN   
12         NaN         NaN         NaN   
13         NaN         NaN         NaN   
14         NaN         NaN         NaN   
15         NaN         NaN         NaN   
16         NaN         NaN         NaN   
17         NaN         NaN         NaN   
18         NaN         NaN         NaN   
19         NaN         NaN         NaN   
20         NaN         NaN         NaN   
21         NaN         NaN         NaN   
22         NaN         NaN        

### Reading All Sheets

To read all sheets into a dictionary of DataFrames:

In [5]:
import pandas as pd

# Read all sheets
dfs = pd.read_excel('LUSID Excel - Setting up your market data.xlsx', sheet_name=None)

# Display the keys (sheet names)
print(dfs.keys())

# Display the DataFrame for a specific sheet
print(dfs['Edit instrument'])

dict_keys(['Datetime format', 'List identifiers', 'List instruments', 'Get instrument definition', 'Edit instrument', 'Add instrument', 'Add property definition', 'List prices 1', 'List prices 2', 'Update prices', 'Create prices', 'Inputs'])
    Unnamed: 0  Unnamed: 1                       Unnamed: 2  \
0          NaN         NaN                              NaN   
1          NaN         NaN                              NaN   
2          NaN         NaN                              NaN   
3          NaN         NaN                              NaN   
4          NaN         NaN                              NaN   
5          NaN         NaN                              NaN   
6          NaN         NaN                              NaN   
7          NaN         NaN                              NaN   
8          NaN         NaN                              NaN   
9          NaN         NaN                              NaN   
10         NaN         NaN                              NaN   
11

These examples demonstrate how to use the `read_excel` function to read data from Excel files into pandas DataFrames.