This project aims to analyze various datasets from Kaggle or other places for the purpose of gaining insights into different aspects of the data, as well as honing my R coding skills through the process of data analysis. The project will continuously be updated by adding new and interesting datasets from Kaggle or other sources, and analyzing them using R.
Dataset:Netflix movies and TV shows dataset This dataset comprises a comprehensive list of movies and
TV shows that are currently or have been previously available on Netflix, including details such as cast, directors, ratings, release year, and duration, etc.
Code:Netflix Content.rmd.
Goal:
- Visualize the number of movies and TV shows available on Netflix.
- Visualize the number of movies and TV shows for each rating.
- Visualize the number of movies produced by each country using a world map plot in ggplot2.
Skill: Data Cleaning, Univariate Analysis, Bivariate Analysis, Descriptive Statistics.
Requirements: The following R packages and versions:
- tidyverse version: 1.3.2.
- skimr version: 2.1.5.
- ggplot2 version: 3.4.0.
Code Output:Netflix Content output.pdf.
Result: Netflix has twice as many movies as TV shows, and most are produced in the United States, followed by India.
Dataset: Insurance Cost Datasets this dataset comprises 1,338 rows and 7. variables such as age, sex, bmi, number of children, smoker status, region and charges for insurance. The target variable is the cost of insurance claims.
Code: insurance-cost-regression.rmd.
Goal:
- Conduct Exploratory Data Analysis on Insurance Dataset.
- Forcast insurance costs
Skill: Data Cleaning, Univariate Analysis, Bivariate Analysis, Descriptive Statistics, Multiple Linear regression, Box-Cox Transformation, RandomForest
Requirement: The following R packages and versions:
- tidyverse version: 1.3.2
- skimr version: 2.1.5
- ggplot2 version: 3.4.0
- car version: 3.1-1
- MASS version: 7.3-58.2
- GGally version: 2.1.2
- RandomForest version: 4.7-1.1
- Caret version: 6.0-93
Code Output: Insurance cost regression output.pdf
Result: The variable smoker is the most significant variable in determining insurance charges, followed by Body Mass Index (BMI) and age. Other factors such as the number of children, region, and sex have minor or no impact on the charges.
Dataset: House Prices
Code: housing-prediction.rmd.
Goal:
- Predict housing price
Skill: Linear regression(Ridge Regression, Lasso Regression, and elastic net regression).
Requirement: The following R packages and versions:
- tidyverse version: 1.3.2
- glmnet
Code Output: Housing Price output.pdf
Acknowledgements: