**Series**

A Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects, etc.).

In [2]:
import pandas as pd
# Creating a Series from a list
series1 = pd.Series([10, 20, 30, 40, 50])
print(series1)

# Creating a Series with custom index labels
series2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(series2)

# Accessing elements
print(series1[0])  # Output: 10
print(series2['b']) # Output: 20

0    10
1    20
2    30
3    40
4    50
dtype: int64
a    10
b    20
c    30
dtype: int64
10
20


**DataFrames:**

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.  It's like a table in a database or a spreadsheet.

In [3]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df)

# Creating a DataFrame from a list of dictionaries
data2 = [{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
         {'Name': 'Bob', 'Age': 30, 'City': 'London'},
         {'Name': 'Charlie', 'Age': 28, 'City': 'Paris'}]
df2 = pd.DataFrame(data2)
print(df2)

# Accessing columns
print(df['Name'])  # Output: a Series containing the names
print(df[['Name', 'Age']]) # Output: a DataFrame with 'Name' and 'Age' columns

# Accessing rows (using .loc or .iloc)
print(df.loc[0])    # Output: the first row (indexed by label)
print(df.iloc[0])   # Output: the first row (indexed by position)

# Adding a new column
df['Salary'] = [60000, 70000, 65000]
print(df)

# Deleting a column
df = df.drop('Salary', axis=1) # axis=1 means column
print(df)

# Deleting a row
df = df.drop(0, axis=0) # axis=0 means row
print(df)

#Renaming columns
df = df.rename(columns={'Name': 'First Name', 'Age': 'Years'})
print(df)

# Resetting the index
df = df.reset_index(drop=True) # drop=True avoids adding the old index as a new column
print(df)


      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   28
Name       Alice
Age           25
City    New York
Name: 0, dtype: object
Name       Alice
Age           25
City    New York
Name: 0, dtype: object
      Name  Age      City  Salary
0    Alice   25  New York   60000
1      Bob   30    London   70000
2  Charlie   28     Paris   65000
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   28     Paris
      Name  Age    City
1      Bob   30  London
2  Charlie   28   Paris
  First Name  Years    City
1        Bob     30  London
2    Charlie     28   Paris
  First Name  Years    City
0        Bob     30  London
1    Charlie     28   Paris


**Reading and Writing Data:**

Pandas can read and write data from various file formats:

In [18]:
# Reading from a CSV file
df_csv = pd.read_csv('../Tutorials/data/data.csv')

# Writing to a CSV file
df.to_csv('output_file.csv', index=False) # index=False prevents writing the index

# Reading from an Excel file
#df_excel = pd.read_excel('your_file.xlsx') # You might need to install openpyxl: pip install openpyxl

# Writing to an Excel file
#df.to_excel('output_file.xlsx', index=False)


**Data Manipulation:**

*   **Filtering:**

In [7]:
# Get rows where Age is greater than 25
filtered_df = df[df['Years'] > 25]
print(filtered_df)

  First Name  Years    City
0        Bob     30  London
1    Charlie     28   Paris


*   **Sorting:**





In [8]:
sorted_df = df.sort_values('Years', ascending=False)  # Sort by Age in descending order
print(sorted_df)

  First Name  Years    City
0        Bob     30  London
1    Charlie     28   Paris


*   **Grouping:**



In [9]:

grouped_df = df.groupby('City')['Years'].mean()  # Group by City and calculate the mean Age
print(grouped_df)


City
London    30.0
Paris     28.0
Name: Years, dtype: float64


*   **Applying functions:**




In [10]:
#Apply a function to a column
df['Years_Doubled'] = df['Years'].apply(lambda x: x * 2)
print(df)


  First Name  Years    City  Years_Doubled
0        Bob     30  London             60
1    Charlie     28   Paris             56


**Missing Data:**

Pandas provides tools for handling missing data:

In [11]:

# Check for missing values
print(df.isnull())

# Fill missing values
df.fillna(0, inplace=True)  # Fill with 0
# or
df['Years'].fillna(df['Years'].mean(), inplace=True) # Fill with mean of the 'Years' column

# Drop rows with missing values
df.dropna(inplace=True)

   First Name  Years   City  Years_Doubled
0       False  False  False          False
1       False  False  False          False


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Years'].fillna(df['Years'].mean(), inplace=True) # Fill with mean of the 'Years' column


**Basic Statistics:**

In [15]:

print(df.describe())  # Get descriptive statistics (mean, std, min, max, etc.)
print(df['Years'].mean())
print(df['Years'].median())
print(df['Years'].max())

           Years  Years_Doubled
count   2.000000       2.000000
mean   29.000000      58.000000
std     1.414214       2.828427
min    28.000000      56.000000
25%    28.500000      57.000000
50%    29.000000      58.000000
75%    29.500000      59.000000
max    30.000000      60.000000
29.0
29.0
30


This is just a starting point. Pandas is a very extensive library with many more features. As you work with data, you'll discover more useful functionalities.  Practice is key! Try working with some sample datasets to get comfortable with these concepts. Let me know if you have any more specific questions.

**The Key Difference: Labels vs. Positions**

*   **`df.loc`**: This is primarily label-based. You use it when you know the specific row and/or column *names* (labels) you want to access.
*   **`df.iloc`**: This is integer-based. You use it when you know the *numerical position* of the rows and/or columns you want.

Think of it this way:

*   `loc` is like looking up a word in a dictionary by its spelling (label).
*   `iloc` is like finding a word in a dictionary by its page number (position).

**Illustrative Example**

Let's use a simple DataFrame:

In [19]:
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 28],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data, index=['person1', 'person2', 'person3'])  # Custom index labels
print(df)

            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


This creates a DataFrame with custom row labels ('person1', 'person2', 'person3') and default column labels ('Name', 'Age', 'City').

**Using `loc`**

In [21]:
# Select the row with label 'person2'
print(df.loc['person2'])

# Select the 'Age' column for the row with label 'person1'
print(df.loc['person1', 'Age'])

# Select multiple rows and columns using labels
print(df.loc[['person1', 'person3'], ['Name', 'City']])

# Slicing with labels (inclusive of the end label)
print(df.loc['person1':'person3', 'Name':'City'])

Name       Bob
Age         30
City    London
Name: person2, dtype: object
25
            Name      City
person1    Alice  New York
person3  Charlie     Paris
            Name  Age      City
person1    Alice   25  New York
person2      Bob   30    London
person3  Charlie   28     Paris


**Using `iloc`**


In [None]:
# Select the row at position 1 (second row)
print(df.iloc[1])

# Select the element at row position 0 and column position 1
print(df.iloc[0, 1])

# Select multiple rows and columns using positions
print(df.iloc[[0, 2], [0, 2]])

# Slicing with positions (exclusive of the end position)
print(df.iloc[0:3, 0:3])


**Important Notes:**

*   **Slicing:** When using `loc` with slices, the end label is *inclusive*. When using `iloc` with slices, the end position is *exclusive* (like regular Python slicing).
*   **Mixed Indexing:** You can technically mix labels and positions with `loc`, but it's generally best to stick to one for clarity.
*   **Boolean Indexing:** Both `loc` and `iloc` can also be used with boolean arrays for more complex filtering.

**When to Use Which**

*   Use `loc` when you're working with data where you know the row and column labels, or when you want to use meaningful names to select data.
*   Use `iloc` when you're working with data where you only know the numerical positions of the rows and columns, or when you need to iterate through rows/columns using their index.

**In Summary**

`df.loc` and `df.iloc` are powerful tools for data selection in Pandas. Understanding the difference between label-based (`loc`) and position-based (`iloc`) indexing is crucial for efficient data manipulation.