<a href="https://colab.research.google.com/github/gitdhirajsv/Awesome-Quant-Machine-Learning-Trading/blob/master/PandasInterviewQuestions_(DataScientist%2CML).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Interview Questions-(DataScientist,Machine Learning, Data Analyst)

Pandas is a powerful Python library for data manipulation and analysis.
- It provides two primary data structures:
    * Series
    * DataFrame
- Pandas is widely used in:
    * data preprocessing
    * exploration
    * transformation
    * analysis tasks

## Importing Pandas

Pandas needs to be imported before use

In [None]:
import pandas as pd

### Creating a DataFrame

- A DataFrame is a fundamental data structure in Pandas, which can be thought of as a two-dimensional table with rows and columns.
- It's a common way to represent and work with structured data.

**From a Dictionary**

In [None]:
data = {'Name': ['Mukesh', 'Papon', 'Charls', 'Zeny', 'Cube'],
        'Age': [25, 30, 28, 23, 32]}
df = pd.DataFrame(data)
df.head()

## Exploring Data

- Pandas provides various methods to help you understand your data

**Displaying First 5 Rows**

In [None]:
df.head()

#### Summary Statistics

- This will provide summary statistics for numeric columns, such as count, mean, standard deviation, minimum, and maximum.

In [None]:
df.describe()

#### Data Types of Columns

- This will display the data types of each column in the DataFrame

In [None]:
df.dtypes

#### Column Names

- This will display the column names of the DataFrame.

In [None]:
print(df.columns)

## Indexing and Selection

Indexing and selecting specific data from a DataFrame are fundamental operations for data analysis and manipulation

#### Accessing a Column

This will retrieve the 'Name' column from the DataFrame

In [None]:
df['Name']

#### Accessing Rows using `iloc`

This will retrieve the first row of the DataFrame.

In [None]:
df.iloc[0]

#### Boolean Indexing

This will return rows where the 'Age' column values are greater than 25

In [None]:
df[df['Age'] > 25]

## Data Cleaning

Data cleaning is a critical step before analysis.
- Pandas provides methods to handle missing values, duplicates, and data type conversions.

In [None]:
import pandas as pd
import numpy as np

# Creating a dictionary for the dataset
data = {
    'Name': ['Mukesh', 'Papon', 'Yono', 'Davendra', 'Cube', np.nan],
    'Age': [25, 30, 28, 22, 27, 23],
    'Gender': ['M', 'F', 'M', 'M', 'F','M'],
    'City': ['Gurugram', 'Pune', 'Kochi', 'Houston', np.nan, np.nan]
}

# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
df

In [None]:
df.isnull().sum()

#### Handling Missing Values

In [None]:
# Drop rows with missing values
df1 = df.copy(deep=True)
df1 = df1.dropna()

df1.isnull().sum()

In [None]:
# Fill missing values with a specific value
#df.fillna(value)

In [None]:
import pandas as pd
import numpy as np

# Creating a dictionary for the dataset
data = {
    'Name': ['Mukesh', 'Papon', 'Yono', 'Davendra', 'Cube', np.nan,'Mukesh'],
    'Age': [28, 30, 28, 22, 27, 23,28],
    'Gender': ['M', 'F', 'M', 'M', 'F','M','M'],
    'City': ['Gurugram', 'Pune', 'Kochi', 'Houston', np.nan, np.nan,'Gurugram']
}

# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
df2 = df.copy(deep=True)
df2

#### Removing Duplicates



In [None]:
df2_deduplicated = df2.drop_duplicates()
df2_deduplicated

## Data Transformation

- Data transformation involves modifying, adding, or reformatting data within a DataFrame
- Pandas offers a variety of methods for these tasks

#### Changing Data Types

In [None]:
df2_deduplicated.info()

In [None]:
df2_deduplicated['Age'] = df2_deduplicated['Age'].astype(object)
df2_deduplicated.info()

#### Applying Functions to Columns

In [None]:
data = df2_deduplicated.copy(deep=True)
data

In [None]:
# Applying a function to a column
data['Age'] = data['Age'].apply(lambda x: x + 1)
data

#### Adding a New Column

In [None]:
data['Salary'] = [50000, 60000, 55000, 45000, 62000, 54000]
data

#### Grouping and Aggregation

In [None]:
# Grouping by a column and calculating mean
grouped = data.groupby('City')['Salary'].mean()
grouped

#### Sorting Data

In [None]:
# Sorting by a column
data_sorted = data.sort_values(by='Age', ascending=False)
data_sorted

## Data Visualization

- Data visualization is a powerful way to understand patterns and insights in your data
- While Pandas itself doesn't directly create visualizations, it works well with visualization libraries like Matplotlib and Seaborn

In [None]:
import pandas as pd
import numpy as np

# Creating a dictionary for the dataset
data = {
    'Name': ['Mukesh', 'Papon', 'Yono', 'Davendra', 'Cube', 'Monica','Mukesh'],
    'Age': [28, 30, 28, 22, 27, 23,28],
    'Gender': ['M', 'F', 'M', 'M', 'F','M','M'],
    'City': ['Gurugram', 'Pune', 'Kochi', 'Houston', 'Singa', 'Citt','Gurugram']
}

# Creating a DataFrame from the dictionary
df = pd.DataFrame(data)
df

#### Simple Line Plot

In [None]:
import matplotlib.pyplot as plt

df.plot(x='Name', y='Age', kind='line')
plt.title('Age Distribution')
plt.xlabel('Name')
plt.ylabel('Age')
plt.show()

#### Histogram

In [None]:
df['Age'].plot(kind='hist', bins=10)
plt.title('Age Histogram')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

## Merging DataFrames

When working with data, you often need to combine multiple datasets to perform comprehensive analysis
- Pandas provides powerful tools to merge, join, and concatenate DataFrames

Concatenation is used when you want to combine DataFrames
* vertically or horizontally

**Concatenating DataFrames**
* `Vertical Concatenation`
* `Horizontal Concatenation`

#### Vertical Concatenation

Suppose you have two DataFrames `df1` and `df2` with the same columns:

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})

print(df1)
print(df2)

You can `concatenate them vertically` using concat():

In [None]:
result = pd.concat([df1, df2])
print(result)

#### Horizontal Concatenation

If you have two DataFrames with different columns and you want to concatenate them horizontally:

In [None]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})

print(df1)
print(df2)

You can `use concat() with the axis parameter`:

In [None]:
result = pd.concat([df1, df2], axis=1)
print(result)

**Merging DataFrames**

Merging is used to combine DataFrames based on a common column or key
* `Inner Merge`

#### Inner Merge

Consider two DataFrames `left and right`:

In [None]:
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'value': [4, 5, 6]})

print(left)
print(right)

You can perform an inner merge on the `key column`:

In [None]:
merged = pd.merge(left, right, on='key')
print(merged)