<a href="https://colab.research.google.com/github/architdhar/DAV/blob/main/exp1_14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Experiment - 1: Getting introduced to data analytics libraries in Python and R**





Lab Objectives: To effectively use libraries for data analytics.




Lab Outcomes (LO): Explore various data analytics Libraries in R and Python. (LO1)

## Data Analytics Libraries in R

### tidyr
The goal of tidyr is to help you create tidy data. Tidy data is data where:

>1)Every column is a variable.

>2)Every row is an observation.

>3)Every cell is a single value.

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in vignette("tidy-data")





In [None]:
install.packages ("tidyverse")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(tidyr)
n = 10
tidy_dataframe = data.frame(
                      S.No = c(1:n),
                    Group.1 = c(23, 345, 76, 212, 88,
                                199, 72, 35, 90, 265),
                    Group.2 = c(117, 89, 66, 334, 90,
                               101, 178, 233, 45, 200),
                    Group.3 = c(29, 101, 239, 289, 176,
                                320, 89, 109, 199, 56))

tidy_dataframe

S.No,Group.1,Group.2,Group.3
<int>,<dbl>,<dbl>,<dbl>
1,23,117,29
2,345,89,101
3,76,66,239
4,212,334,289
5,88,90,176
6,199,101,320
7,72,178,89
8,35,233,109
9,90,45,199
10,265,200,56


### dyplr
**What is dplyr?**

> The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

**What's special about dplyr?**

> The package "dplyr" comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it's very easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is used to filter rows.



In [None]:
library(dplyr)
library(tidyr)

In [None]:

d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
				age = c(7, 5, 9, 16),
				ht = c(46, NA, NA, 69),
				school = c("yes", "yes", "no", "no"))


print(d)
rows_with_na <- d %>% filter(is.na(ht))

print(rows_with_na)

rows_without_na <- d %>% filter(!is.na(ht))
print(rows_without_na)


     name age ht school
1    Abhi   7 46    yes
2 Bhavesh   5 NA    yes
3  Chaman   9 NA     no
4   Dimri  16 69     no
     name age ht school
1 Bhavesh   5 NA    yes
2  Chaman   9 NA     no
   name age ht school
1  Abhi   7 46    yes
2 Dimri  16 69     no


### readr
The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results. If you are new to readr, the best place to start is the data import chapter in R for Data Science.

The easiest way to get readr is to install the whole tidyverse:

>install.packages("tidyverse")

Alternatively, install just readr:

>install.packages("readr")

### stringr

There are four main families of functions in stringr:

1. Character manipulation: these functions allow you to manipulate individual characters within the strings in character vectors.
2. Whitespace tools to add, remove, and manipulate whitespace.
3. Locale sensitive operations whose operations will vary from locale to locale.
4. Pattern matching functions. These recognise four engines of pattern description. The most common is regular expressions, but there are three other tools.

**Pattern matching**

The vast majority of stringr functions work with patterns. These are parameterised by the task they perform and the types of patterns they match.


### jsonlite
The jsonlite package is a JSON parser/generator optimized for the web. Its main strength is that it implements a bidirectional mapping between JSON data and the most important R data types. Thereby we can convert between R objects and JSON without loss of type or information, and without the need for any manual data munging. This is ideal for interacting with web APIs, or to build pipelines where data structures seamlessly flow in and out of R using JSON.

>**Simplification**

Simplification is the process where JSON arrays automatically get converted from a list into a more specific R class. The fromJSON function has 3 arguments which control the simplification process: simplifyVector, simplifyDataFrame and simplifyMatrix. Each one is enabled by default.

>**Atomic Vectors**

When simplifyVector is enabled, JSON arrays containing primitives (strings, numbers, booleans or null) simplify into an atomic vector:

## Data Analytics Libraries in Python

### Tensorflow

TensorFlow makes it easy for beginners and experts to create machine learning models for desktop, mobile, web, and cloud. See the sections below to get started.

Data can be the most important factor in the success of your ML endeavors. TensorFlow offers multiple data tools to help you consolidate, clean and preprocess data at scale:

1. Standard datasets for initial training and validation

2. Highly scalable data pipelines for loading data

3. Preprocessing layers for common input transformations

4. Tools to validate and transform large datasets

Additionally, responsible AI tools help you uncover and eliminate bias in your data to produce fair, ethical outcomes from your models.

### Numpy
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

**POWERFUL N-DIMENSIONAL ARRAYS**

Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today.

**NUMERICAL COMPUTING TOOLS**

NumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more.

**OPEN SOURCE**

Distributed under a liberal BSD license, NumPy is developed and maintained publicly on GitHub by a vibrant, responsive, and diverse community.

In [None]:
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
c = a + b
print(c)


[[ 6  8]
 [10 12]]


### SciPy
**FUNDAMENTAL ALGORITHMS**

SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.

**BROADLY APPLICABLE**

The algorithms and data structures provided by SciPy are broadly applicable across domains.

**FOUNDATIONAL**

Extends NumPy providing additional tools for array computing and provides specialized data structures, such as sparse matrices and k-dimensional trees.

**PERFORMANT**

SciPy wraps highly-optimized implementations written in low-level languages like Fortran, C, and C++. Enjoy the flexibility of Python with the speed of compiled code.

### Pandas
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.

1. Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

2. Ordered and unordered (not necessarily fixed-frequency) time series data.

3. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

4. Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

In [None]:
import pandas as pd

data = {'name': ['Alice', 'Bob', 'Carol', 'Dave'],
        'age': [20, 25, 30, 35]}
df = pd.DataFrame(data)


print(df)


print(df['age'])


print(df.loc[0])


print(df.loc[[0, 2]])


print(df[['name', 'age']])


print(df[df['age'] > 25])


print(df.sort_values('age'))


print(df.groupby('age'))


print(df['age'].mean())


df.to_csv('data.csv')


    name  age
0  Alice   20
1    Bob   25
2  Carol   30
3   Dave   35
0    20
1    25
2    30
3    35
Name: age, dtype: int64
name    Alice
age        20
Name: 0, dtype: object
    name  age
0  Alice   20
2  Carol   30
    name  age
0  Alice   20
1    Bob   25
2  Carol   30
3   Dave   35
    name  age
2  Carol   30
3   Dave   35
    name  age
0  Alice   20
1    Bob   25
2  Carol   30
3   Dave   35
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7868166ceb60>
27.5


### Scrapy
Scrapy is an open-source Python framework whose goal is to make web scraping easier. You can build robust and scalable spiders with its comprehensive set of built-in features.

While the stand-alone options are libraries like Requests for HTTP requests, BeautifulSoup for data parsing and Selenium to deal with JavaScript-based sites, Scrapy offers all their functionality.
It includes:

>HTTP connections.

>Support for CSS Selectors and XPath expressions.

>Data export to CSV, JSON, JSON lines, and XML.

>Ability to store data on FTP, S3, and local file system.

>Middleware support for integrations.

>Cookie and session management.

>JavaScript rendering with Splash.

>Support for automated retries.

>Concurrency management.

>Built-in crawling capabilities.

Additionally, its active community has created extensions to further enhance its capabilities, allowing developers to tailor the tool to meet their specific scraping requirements.

Conclusion:- Hence we have successfully  explored data analytic using libraries in python and R