# Data manipulation with Pandas

## DataFrame and Series

### Pandas Series

#### Creation

You can create a Series from a list, dictionary, or a scalar value.

In [1]:
import pandas as pd

#from a list
s1 = pd.Series([1,3,5,7,9])

#from a dictionary
s2 = pd.Series({'a': 1,'b': 3,'c': 5})

#from a scalar value
s3 = pd.Series(5, index=[0,1,2,3])

print(s1)
print(s2)
print(s3)

0    1
1    3
2    5
3    7
4    9
dtype: int64
a    1
b    3
c    5
dtype: int64
0    5
1    5
2    5
3    5
dtype: int64


#### Operations

You can perform operations on Series just like you would with a NumPy array.

In [2]:
so1 = pd.Series([1,2,3,4,5])
so2 = pd.Series([10,20,30,40,50])

so3 = so1 + so2

#so1[0]
#so1[:3]

so3[:3]

0    11
1    22
2    33
dtype: int64

### Pandas DataFrame

#### Creation

A DataFrame is a 2D labeled data structure with columns of potentially different types. You can create it from dictionaries, lists, or NumPy arrays.

In [9]:
data = {
    'Name': ['Ash', 'Bob', 'Charlie', 'Ash', 'Bob'],
    'Age': [25,30,35,40,45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)
print(df)

#print(pd.DataFrame(data))

      Name  Age         City
0      Ash   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3      Ash   40     New York
4      Bob   45  Los Angeles


#### Accessing Data

You can access rows and columns using labels and positions.

In [4]:
# accessing a column
age = df['Age']

# accessing rows by label
row_1 = df.loc[0]

# accessing rows by position
row_2 = df.iloc[1]

print(age,'\n')
print(row_1,'\n')
print(row_2,'\n')

0    25
1    30
2    35
3    40
4    45
Name: Age, dtype: int64 

Name         Ash
Age           25
City    New York
Name: 0, dtype: object 

Name            Bob
Age              30
City    Los Angeles
Name: 1, dtype: object 



#### Adding/Removing Columns

You can easily add or remove columns in a DataFrame.

In [7]:
# adding a new column
df['Salary']= [50000, 60000, 70000, 80000, 90000]

# removing a column
df.drop('City', axis=1, inplace=True)

print(df)

      Name  Age  Salary
0      Ash   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3      Ash   40   80000
4      Bob   45   90000


#### Filtering Data

You can filter data based on conditions.

In [None]:
# filtering rows where age> 30
df_filtered = df[df['Age'] > 30]

print(df_filtered)

#### GroupBy Operations

Grouping data and applying functions is straightforward.

In [12]:
grouped = df.groupby('City').mean(numeric_only=True) # Select only numeric columns and apply the mean function
print(grouped)

              Age
City             
Chicago      35.0
Los Angeles  37.5
New York     32.5


#### Identifying data types of columns in DataFrame

You can identify various data types in your DataFrame using the select_dtypes method in pandas. Here are some common data types you might encounter:

Number: This includes both integers (int) and floating-point numbers (float).

Object: Typically used for strings, but can also include mixed types.

Datetime: Timestamps or date-related data (datetime64).

Boolean: True/False values (bool).

Category: Categorical data, which can be more memory efficient than strings.

Timedelta: Differences between two datetimes (timedelta).

In [14]:
# Identify numeric columns
numeric_columns = df.select_dtypes(include=['number']).columns
print("Numeric columns:", numeric_columns)

# Identify object columns
object_columns = df.select_dtypes(include=['object']).columns
print("Object columns:", object_columns)

# Identify datetime columns
datetime_columns = df.select_dtypes(include=['datetime64']).columns
print("Datetime columns:", datetime_columns)

# Identify boolean columns
boolean_columns = df.select_dtypes(include=['bool']).columns
print("Boolean columns:", boolean_columns)

# Identify categorical columns
categorical_columns = df.select_dtypes(include=['category']).columns
print("Categorical columns:", categorical_columns)

# Identify timedelta columns
timedelta_columns = df.select_dtypes(include=['timedelta']).columns
print("Timedelta columns:", timedelta_columns)

Numeric columns: Index(['Age'], dtype='object')
Object columns: Index(['Name', 'City'], dtype='object')
Datetime columns: Index([], dtype='object')
Boolean columns: Index([], dtype='object')
Categorical columns: Index([], dtype='object')
Timedelta columns: Index([], dtype='object')
