Jupyter Notebook (.ipynb):
Jupyter Notebooks are interactive documents that allow users to combine code, text, images, and visualizations in a single environment. It supports multiple programming languages, but Python is one of the most popular choices for data analysis.
Advantages:
1. Interactive Execution: Jupyter Notebooks allow users to execute code cells interactively, which means you can run a specific code cell and see its output immediately, facilitating data exploration and analysis. This feature is particularly useful when exploring datasets or experimenting with algorithms.
2. Rich Media Integration: Jupyter Notebooks support Markdown cells, allowing users to add explanatory text, equations, images, and even interactive visualizations. This feature enhances the documentation and storytelling aspect of data analysis, making it easier to share findings and insights.
3. Data Visualization: With the use of libraries like Matplotlib and Seaborn, Jupyter Notebooks can produce dynamic visualizations within the document itself. This capability helps present results effectively.
4. Code Modularity: Jupyter Notebooks allow users to break down code into smaller, manageable cells. This modularity aids in code organization and makes it easier to debug and maintain.
Disadvantages:
1. Version Control: Tracking changes in Jupyter Notebooks using version control systems like Git can be challenging. Since Notebooks are stored in JSON format, simple code changes can result in extensive diffs, leading to potential merge conflicts.
2. Execution Order: The order in which cells are executed can significantly impact the results. This can lead to confusion and potential errors when sharing Notebooks with others or revisiting them after some time.
3. Performance: Jupyter Notebooks are not ideal for computationally intensive tasks. Long-running processes or resource-heavy computations can cause performance issues or kernel crashes.
Python File (.py):
Python files, denoted by the .py extension, contain Python code exclusively. They are standard script files and are often used for developing functions, classes, and larger projects.
Advantages:
1. Reproducibility: Python scripts promote a more structured and linear workflow, enhancing reproducibility. All code is written in the script's order, making it easier to understand the flow of execution.
2. Version Control: Python files integrate well with version control systems like Git, enabling efficient collaboration and tracking of changes.
3. Performance: Compared to Jupyter Notebooks, Python scripts generally perform better for larger datasets and computationally intensive tasks. They avoid the overhead of interactive execution and can be optimized for speed.

4. Modularization: Python files encourage the creation of reusable functions and modules, facilitating code organization and maintainability.
Disadvantages:
1. Lack of Interactivity: Python scripts do not provide the same level of interactivity as Jupyter Notebooks. Data analysts might find it less convenient when performing exploratory data analysis or experimenting with code snippets.
2. Data Visualization: While Python scripts can generate visualizations using libraries like Matplotlib and Seaborn, the integration of visualizations within the script itself is not as seamless as in Jupyter Notebooks.
3. Documentation and Storytelling: Although comments can be added to Python scripts, they do not provide the same rich media integration as Jupyter Notebooks, making it more challenging to create comprehensive and visually appealing analysis reports.




A Series is a one-dimensional labeled array capable of holding data of any type. It is similar to a NumPy array but has additional functionality and flexibility. A Series consists of a data array and an associated index, which allows for easy and efficient access to the data.

In [2]:
import pandas as pd

data = [1, 2, 3, 4, 5]
index = ['A', 'B', 'C', 'D', 'E']
series = pd.Series(data, index=index)

print(series)


A    1
B    2
C    3
D    4
E    5
dtype: int64


A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is like a spreadsheet or SQL table, where data is organized in rows and columns. Each column in a DataFrame is a Pandas Series.

In [3]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Joe', 'Nat', 'Harry', 'Sam', 'Monica'],
    'Age': [20, 21, 19, 20, 22],
    'Occupation': ['Manager', 'Teacher', 'Accountant', 'Therapist', 'Administrator']
}

df = pd.DataFrame(data)

print(df)


     Name  Age     Occupation
0     Joe   20        Manager
1     Nat   21        Teacher
2   Harry   19     Accountant
3     Sam   20      Therapist
4  Monica   22  Administrator


A Pandas DataFrame represents rectangular data, which means that the data is organized in a tabular format with rows and columns, where each row represents an observation or record, and each column represents a different attribute or variable. The rectangular structure implies that all rows have the same number of columns, forming a consistent grid-like arrangement.

On the other hand, non-rectangular data does not follow this strict tabular structure. It may have varying lengths for rows or differing numbers of columns across different observations. Non-rectangular data formats can include hierarchical structures, nested data, graph-based representations, or any other data organization that does not adhere to the standard row-column grid. Examples of non-rectangular data formats include JSON objects, XML files, and graph databases, among others.

a. Data visualizations useful for data scientists to identify patterns and highlight important aspects of a dataset:

1. Histogram: Histograms are beneficial for data scientists to understand the distribution of a continuous variable. For example, plotting a histogram of customer ages in a marketing dataset can reveal age groups with the highest concentration of customers, allowing data scientists to target specific age demographics for marketing campaigns.

2. Scatterplot: Scatterplots help data scientists identify relationships and correlations between two continuous variables. For instance, plotting the relationship between advertising expenditure and sales can show if there is a positive or negative correlation, aiding data scientists in optimizing advertising strategies to increase sales.

3. Multiline Plot: Multiline plots are valuable for visualizing trends and patterns over time or across different categories. For instance, plotting the monthly sales of different products on a multiline chart can help data scientists identify the products with consistent growth or seasonal variations, enabling better inventory management and marketing decisions.

4. Bar Graph: Bar graphs are useful for comparing categorical data. Data scientists can use bar graphs to highlight significant differences between groups. For example, comparing the sales performance of different product categories using a bar graph can reveal which categories are the most profitable or have the highest demand.

b. Data visualizations suitable for telling a story and creating business presentations:

1. Storytelling with Histogram: Using a sequence of histograms, data scientists can present the evolution of a variable over time or across different segments. For example, showing the change in customer satisfaction scores over the years in various regions can provide insights into the effectiveness of marketing campaigns and customer service efforts.

2. Interactive Scatterplot: Interactive scatterplots can be used in business presentations to engage the audience and explore relationships between variables interactively. For instance, showcasing how different marketing strategies impact customer loyalty by allowing the audience to interact with the scatterplot and observe the effects in real-time.

3. Multiline Plot with Annotations: A multiline plot with annotated points can be used to highlight significant events or milestones in a business's journey. For example, visualizing the revenue growth of a startup and adding annotations for product launches or major funding rounds can create a compelling narrative for potential investors.

4. Bar Graph Infographics: Creating infographics with bar graphs can effectively communicate key performance indicators and business metrics to stakeholders. For instance, presenting market share data of different competitors using colorful bar graphs in an infographic can make the information easily understandable and visually appealing.



In [4]:
import pandas as pd
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')

In [10]:
print("Number of rows:", titanic.shape[0])
print("Number of columns:", titanic.shape[1])


Number of rows: 891
Number of columns: 15


In [13]:
print(titanic.describe())

print(titanic.count())

print("Mean Age:", titanic['age'].mean())

print("Min Age:", titanic['age'].min())
print("Max Age:", titanic['age'].max())


         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200
survived       891
pclass         891
sex            891
age            714
sibsp          891
parch          891
fare           891
embarked       889
class          891
who            891
adult_male     891
deck           203
embark_town    889
alive          891
alone          891
dtype: int64


In [18]:
missing_values = titanic.isnull().sum()

print("Number of missing values for each column:")
print(missing_values)

Number of missing values for each column:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [23]:
titanic_cleaned = titanic.dropna(subset=['age']).copy()
titanic_cleaned.loc[:, 'age_group'] = pd.cut(titanic_cleaned['age'], bins=range(0, 81, 10), right=False)


In [24]:
# Group by 'age_group' and 'sex', then count passengers in each category
age_sex_counts = titanic_cleaned.groupby(['age_group', 'sex']).size().reset_index(name='count')

# Group by 'age_group' and 'class', then count passengers in each category
age_class_counts = titanic_cleaned.groupby(['age_group', 'class']).size().reset_index(name='count')

print(age_sex_counts)
print(age_class_counts)


   age_group     sex  count
0    [0, 10)  female     30
1    [0, 10)    male     32
2   [10, 20)  female     45
3   [10, 20)    male     57
4   [20, 30)  female     72
5   [20, 30)    male    148
6   [30, 40)  female     60
7   [30, 40)    male    107
8   [40, 50)  female     32
9   [40, 50)    male     57
10  [50, 60)  female     18
11  [50, 60)    male     30
12  [60, 70)  female      4
13  [60, 70)    male     15
14  [70, 80)  female      0
15  [70, 80)    male      6
   age_group   class  count
0    [0, 10)   First      3
1    [0, 10)  Second     17
2    [0, 10)   Third     42
3   [10, 20)   First     18
4   [10, 20)  Second     18
5   [10, 20)   Third     66
6   [20, 30)   First     34
7   [20, 30)  Second     53
8   [20, 30)   Third    133
9   [30, 40)   First     50
10  [30, 40)  Second     48
11  [30, 40)   Third     69
12  [40, 50)   First     37
13  [40, 50)  Second     18
14  [40, 50)   Third     34
15  [50, 60)   First     27
16  [50, 60)  Second     15
17  [50, 60)   Third