# Data acquisition

## Extracting links from Hyperlinks (not needed)

The excel file contains urls of the web pages as hyperlinks which cannot be read directly using pandas. Hence, we use Excel VBA to creat a module that extracts the hyperlink from the cell passed as its argument.

Code:

    Public Function GetURL(c As Range) As String
        On Error Resume Next
        GetURL = c.Hyperlinks(1).Address
    End Function

Once created, the GetURL function can be used. For example, if cell A1 contains a hyperlink, the following formula would return its address:

=GetURL(A1)

## Web Scraping (using pandas and requests) (not needed)

In [1]:
import requests
import pandas as pd

# Define the Excel file path
excel_file = "BPCLIPEX News Links (1).xlsx"

# Read all sheets into a dictionary of DataFrames
dfs = pd.read_excel(excel_file, sheet_name=['August23','September23','October 23','November 23','December 23','Dec23 Shortlist','January24','February2024','March2024'])

# List to store the content of each article across all sheets
article_contents = []
successful_rows = []

# Iterate over each sheet and each URL in the DataFrame
for sheet_name, df in dfs.items():
    print(f"Reading sheet: {sheet_name}")
    for index, row in df.iterrows():
        url = row['Hyperlinks']
        try:
            response = requests.get(url)
            # Check if the request was successful
            if response.status_code == 200:
                article_contents.append(response.text)
                successful_rows.append(row)
                print(f"Successfully retrieved content for URL: {url}")
            else:
                print(f"Failed to retrieve {url} with status code {response.status_code}")
        except requests.exceptions.RequestException as e:
            print(f"An error occurred while fetching {url} from sheet {sheet_name}: {e}")

# Create a new DataFrame with only the successful rows
successful_df = pd.DataFrame(successful_rows)

# Now `article_contents` contains the content of each article across all sheets
print(f"Retrieved content for {len(article_contents)} articles across all sheets.")

# Print the successful DataFrame
print("DataFrame with successful rows:")
print(successful_df)

Reading sheet: August23
Successfully retrieved content for URL: https://www.livemint.com/opinion/online-views/india-can-usher-in-the-beginning-of-the-end-of-an-ice-age-globally-11692288308958.html
Successfully retrieved content for URL: https://www.deccanherald.com/business/companies/volvo-india-to-bid-in-govts-e-bus-programme-says-company-chief-2651958
Failed to retrieve https://www.financialexpress.com/industry/optimizing-warehouse-space-how-asrs-maximizes-storage-capacity-at-india-warehousing-intra-logistics-companies/3215425/ with status code 403
Successfully retrieved content for URL: https://m.timesofindia.com/auto/news/bharat-ncap-to-be-launched-by-nitin-gadkari-next-week-indias-own-crash-test-rating/articleshow/102876658.cms
Successfully retrieved content for URL: https://m.timesofindia.com/city/vijayawada/andhra-pradeshs-avera-making-waves-on-global-e-scooters-market/articleshow/102861628.cms
Successfully retrieved content for URL: https://www.livemint.com/opinion/online-views

Successfully retrieved content for URL: https://www.republicworld.com/business-news/india-business/hotel-industrys-gdp-contribution-to-reach-rs-83-dot-15-lakh-crore-by-2047-hai-report-articleshow.html
Failed to retrieve https://www.hindustantimes.com/lifestyle/travel/tourism-ministry-launches-campaign-to-showcase-india-as-premier-wedding-destination-101692509493540.html with status code 401
Successfully retrieved content for URL: https://hospitality.economictimes.indiatimes.com/news/travel/india-stands-poised-to-be-tourism-powerhouse-up-minister-of-state/102854911
Successfully retrieved content for URL: https://www.thehindu.com/news/national/kerala/ayurveda-wellness-tourism-industry-hit-by-erratic-monsoon/article67216856.ece
Failed to retrieve https://www.financialexpress.com/industry/fe-ecube-study-how-some-trailblazing-companies-are-unleashing-a-green-tidal-wave/3211468/ with status code 403
Successfully retrieved content for URL: https://oilprice.com/Energy/Energy-General/Saudi-Arab

Failed to retrieve https://www.financialexpress.com/industry/adani-green-energy-to-raise-5-billion-through-global-bonds/3212926/ with status code 403
Successfully retrieved content for URL: https://www.autocarpro.in/news/bharat-charge-alliance-and-chademo-association-partner-to-promote-interoperable-charging-infrastructure-in-india-116391
Failed to retrieve https://www.marineinsight.com/shipping-news/india-to-build-5-fleet-support-ships-worth-20000-crore-a-major-boost-to-its-naval-capacity/ with status code 404
Failed to retrieve https://www.financialexpress.com/business/infrastructure-massive-cost-overruns-plague-388-infrastructure-projects-india-faces-rs-4-65-lakh-crore-setback-in-july-2023-3216133/ with status code 403
Successfully retrieved content for URL: https://www.livemint.com/market/stock-market-news/sebi-plans-to-bring-follow-on-offer-rules-for-reits-invits-11692519612262.html
Successfully retrieved content for URL: https://m.economictimes.com/industry/telecom/telecom-news/t

Successfully retrieved content for URL: https://www.thehindu.com/business/vizag-steel-plant-logs-highest-ever-monthly-sales-of-value-added-steel-in-august/article67263863.ece
Failed to retrieve https://www.business-standard.com/industry/news/telecom-bill-likely-to-require-verification-of-users-on-otts-apps-123090300716_1.html with status code 403
Successfully retrieved content for URL: https://www.bqprime.com/business/democratising-infrastructure-investment-in-india-the-alternative-bet
Successfully retrieved content for URL: https://www.cnbctv18.com/economy/lpg-price-cut-cabinet-decides-to-make-domestic-cylinders-cheaper-by-rs-200-17666451.htm
Successfully retrieved content for URL: https://www.livemint.com/news/india/us-and-india-join-forces-to-launch-renewable-energy-technology-action-platform-11693408975833.html
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/renewable/explainer-indias-green-hydrogen-standard/103114649
Successfully retrieved 

Failed to retrieve https://www.business-standard.com/economy/news/jnpa-eyes-top-spot-in-container-volumes-amid-gati-shakti-boosted-revamp-123083100017_1.html with status code 403
Successfully retrieved content for URL: https://www.constructionworld.in/transport-infrastructure/ports-and-shipping/revamp-empowers-older-private-cargo-terminals-at-major-ports-/43694
Successfully retrieved content for URL: https://www.constructionworld.in/transport-infrastructure/ports-and-shipping/kolkata-port-offers-60-acre-for-state-s-first-multi-modal-logistics-park/43798
Successfully retrieved content for URL: https://travel.economictimes.indiatimes.com/news/aviation/domestic/india-new-zealand-sign-mou-for-civil-aviation-cooperation/103190022
Successfully retrieved content for URL: https://www.saurenergy.com/solar-energy-news/indian-airports-move-towards-renewables-glare-hazards-pose-new-challenge
Failed to retrieve https://www.financialexpress.com/healthcare/news-healthcare/government-has-taken-various

Successfully retrieved content for URL: https://www.thehindu.com/news/international/sri-lanka-ends-fuel-rationing-imposed-at-height-of-economic-crisis/article67260860.ece
Failed to retrieve https://www.business-standard.com/economy/news/world-s-first-flex-fuel-car-will-launch-in-india-today-why-this-matters-123082900260_1.html with status code 403
Successfully retrieved content for URL: https://www.millenniumpost.in/nation/up-gears-up-to-showcase-food-basket-of-india-at-international-trade-show-531249
Failed to retrieve https://www.financialexpress.com/business/industry/morgan-stanley-arm-partners-with-mumbai-realtor-for-warehousing-project/3225523/ with status code 403
Failed to retrieve https://www.business-standard.com/companies/news/jio-submits-second-legal-opinion-on-satellite-spectrum-backing-auction-123090300459_1.html with status code 403
Successfully retrieved content for URL: https://m.timesofindia.com/city/ahmedabad/preference-to-new-tech-guj-loses-42l-telecom-subscribers/ar

Successfully retrieved content for URL: https://www.thehindu.com/news/national/russia-india-energy-ties-to-increase-this-year-says-trade-commissioner/article67252384.ece
Failed to retrieve https://www.business-standard.com/companies/news/shipping-logistics-firm-ups-opens-india-s-first-tech-centre-in-chennai-123082800689_1.html with status code 403
Failed to retrieve https://www.financialexpress.com/business/express-mobility-optimsing-resources-and-transport-infrastructure-a-path-to-reducing-national-logistic-costs-and-curtailing-dry-trucking-losses-in-india-3231495/ with status code 403
Failed to retrieve https://www.business-standard.com/india-news/slow-progress-in-highway-construction-persists-despite-government-push-123090100914_1.html with status code 403
Failed to retrieve https://www.business-standard.com/industry/news/competition-heats-up-for-iaf-s-mta-deal-embraer-launches-its-bid-in-india-123082800869_1.html with status code 403
Successfully retrieved content for URL: https://

Successfully retrieved content for URL: https://swarajyamag.com/infrastructure/cil-should-focus-on-easy-diversification-opportunity-in-power-generation
Successfully retrieved content for URL: https://www.thehindu.com/news/cities/Visakhapatnam/parliamentary-committee-on-chemicals-fertilizers-to-visit-visakhapatnam-for-two-days-from-september-1/article67252240.ece
Successfully retrieved content for URL: https://psuwatch.com/newsupdates/patel-engineering-jv-bag-rs-3637-cr-contract-from-nhpc
Successfully retrieved content for URL: https://psuwatch.com/newsupdates/govt-cuts-windfall-tax-on-domestic-crude-hikes-levy-on-export-of-diesel-atf
Successfully retrieved content for URL: https://www.livemint.com/news/india/us-and-india-join-forces-to-launch-renewable-energy-technology-action-platform-11693408975833.html
Successfully retrieved content for URL: https://www.thehindubusinessline.com/economy/higher-diesel-exports-help-refined-products-shipment-grow-15-y-o-y-in-july/article67241058.ece
Suc

Successfully retrieved content for URL: https://www.thehindu.com/news/national/tamil-nadu/hyundai-donates-ambulances-to-government-hospital-and-phc-in-cuddalore-district/article67254946.ece
Successfully retrieved content for URL: https://www.autocarpro.in/news-international/hyundai-motor-group-partners-korea-zinc-on-nickel-value-chain-for-ev-business-116571
Successfully retrieved content for URL: https://www.autocarpro.in/news/hero-motocorp-set-to-have-portfolio-of-half-a-dozen-premium-bikes-116559
Successfully retrieved content for URL: https://www.livemint.com/companies/auto-firms-look-to-speed-up-as-festive-season-kicks-off-11693420559226.html
Successfully retrieved content for URL: https://www.autocarindia.com/bike-news/new-ola-s1-range-receives-over-75000-bookings-429180
Successfully retrieved content for URL: https://auto.economictimes.indiatimes.com/news/industry/how-tvs-is-leveraging-tech-for-a-premium-play-with-x-its-latest-offering/103206988
Successfully retrieved content for

Failed to retrieve https://www.business-standard.com/markets/news/bpcl-ioc-hpcl-stocks-slide-on-reports-omcs-may-bear-lpg-subsidy-cost-123083000502_1.html with status code 403
Successfully retrieved content for URL: https://www.deccanherald.com/business/rs-200-lpg-price-cut-on-oil-companies-govt-unlikely-to-give-subsidy-2666455
Successfully retrieved content for URL: https://www.navhindtimes.in/2023/08/31/business/lpg-price-cut-govt-unlikely-to-give-subsidy-to-oil-companies/
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/oil-and-gas/rs-200-lpg-price-cut-on-oil-companies-government-unlikely-to-give-subsidy/103226836
Successfully retrieved content for URL: https://www.thehindu.com/news/national/one-in-four-ujjwala-yojana-beneficiaries-took-zero-or-one-lpg-cylinder-refills-last-year-despite-200-subsidy-rti-data-reveals/article67255260.ece
Successfully retrieved content for URL: https://www.newindianexpress.com/business/2023/aug/31/omcs-may-have-to

Successfully retrieved content for URL: https://indiashorts.com/carbon-capture-utilization-and-storage-market-estimated-to-reach-14-2-billion-by-2030-globally-at-a-cagr-of-21-5-says-marketsandmarkets/141656/
Successfully retrieved content for URL: https://indianexpress.com/article/research/the-dark-story-of-oil-the-lubricant-of-the-global-economy-8919029/
Successfully retrieved content for URL: https://www.zeebiz.com/markets/stocks/news-gr-infraprojects-share-price-nse-bse-latest-news-251164
Successfully retrieved content for URL: https://www.einnews.com/pr_news/653031551/delhi-to-host-the-biggest-conference-on-bitumen-petrochemicals-and-petro-products
Successfully retrieved content for URL: https://www.maritimejournal.com/vessels/special-feature-deep-dive-into-methanol-part-1/1487014.article
Successfully retrieved content for URL: https://auto.economictimes.indiatimes.com/news/oil-and-lubes/indonesias-pertamina-plans-more-biofuel-products-ethanol-imports-in-2024/103220325
Failed to re

Successfully retrieved content for URL: https://www.thehindu.com/business/sugar-industry-body-isma-seeks-5-gst-on-flex-fuel-vehicles-in-line-with-evs/article67245320.ece
Successfully retrieved content for URL: https://www.thehindu.com/news/national/tamil-nadu/tangedco-yet-to-implement-reduced-tariff-for-ev-charging-stations/article67256304.ece
Successfully retrieved content for URL: https://www.autocarpro.in/analysis-sales/electric-two-wheeler-sales-bounce-back-in-august-to-60000-units-116584
Successfully retrieved content for URL: https://indianexpress.com/article/business/market/auto-sales-at-all-time-monthly-high-but-2-wheeler-sales-crawl-8920418/
Successfully retrieved content for URL: https://www.autocarpro.in/news/ola-electric-on-course-for-2-lakh-units-sales-in-2023-to-grow-by-20-25-percent-116604
Failed to retrieve https://www.business-standard.com/companies/results/ev-maker-ather-energy-s-revenue-jumps-4-4-times-loss-2-5-times-in-fy23-123083000836_1.html with status code 403
F

Successfully retrieved content for URL: https://www.livemint.com/opinion/first-person/why-india-s-success-in-financial-dpi-should-be-replicated-in-climate-action-11693561841614.html
Successfully retrieved content for URL: https://www.ft.com/content/6c42055f-5d7c-4af2-bc5c-c6e559f26986
Successfully retrieved content for URL: https://www.livemint.com/opinion/first-person/why-india-s-success-in-financial-dpi-should-be-replicated-in-climate-action-11693561841614.html
Successfully retrieved content for URL: https://www.thehindu.com/news/national/jan-dhan-yojana-revolutionised-financial-inclusion-in-india-more-than-50-cr-bank-acs-opened-fm/article67243739.ece
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/renewable/three-challenges-india-must-overcome-for-a-successful-energy-transition/103231679
Failed to retrieve https://www.financialexpress.com/opinion/the-highway-out-of-poverty-technology-enabling-equitable-access-to-social-protection-systems/3228

Successfully retrieved content for URL: https://www.thehindu.com/news/cities/Kochi/scientists-unravel-genome-secrets-of-indian-oil-sardine/article67280727.ece
Successfully retrieved content for URL: https://infra.economictimes.indiatimes.com/news/ports-shipping/at-pm-modis-behest-global-maritime-summit-being-shifted-to-mumbai/103518694
Successfully retrieved content for URL: https://www.hydrogenfuelnews.com/hydrogen-fuel-india-tata/8560540/
Failed to retrieve https://www.business-standard.com/industry/news/industry-seeks-govt-intervention-on-data-requirements-for-eu-s-carbon-tax-123090700764_1.html with status code 403
Successfully retrieved content for URL: https://oilprice.com/Energy/Crude-Oil/The-Tipping-Point-In-Global-Oil-Demand.html
Successfully retrieved content for URL: https://www.cnbc.com/2023/09/06/india-importing-russian-oil-is-win-win-for-global-economy-says-ongc.html
Successfully retrieved content for URL: https://www.hydrocarbonengineering.com/refining/07092023/aveva-rec

Successfully retrieved content for URL: https://www.thehindubusinessline.com/news/national/south-korean-industry-delegation-to-visit-india-to-enhance-defence-ties/article67266791.ece
An error occurred while fetching https://simpleflying.com/iata-optimistic-walsh-warns-capacity-cuts-destroy-jobs/ from sheet September23: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Successfully retrieved content for URL: https://www.thehindu.com/news/national/tamil-nadu/tamil-nadu-government-to-soon-come-up-with-action-plan-for-development-of-tech-cities-minister/article67280137.ece
Successfully retrieved content for URL: https://www.livemint.com/industry/energy/g20-summit-in-delhi-global-leaders-to-discuss-green-hydrogen-manufacturing-and-supply-11693964667017.html
Successfully retrieved content for URL: https://timesofindia.indiatimes.com/travel/travel-news/finland-becomes-the-first-european-nation-to-test-the-digital-passports/articleshow/103438530.cms
S

Successfully retrieved content for URL: https://www.psuconnect.in/news/coal-india-organises-vendor-development-program-for-women-entrepreneurs/39167/
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/coal/india-asks-utilities-to-import-4-per-cent-coal-until-march-2024/103337074
Successfully retrieved content for URL: https://m.timesofindia.com/india/tripling-re-capacity-ok-but-about-phasing-out-coal/articleshow/103553360.cms
Successfully retrieved content for URL: https://www.indiatoday.in/environment/story/g20s-per-capita-carbon-dioxide-emissions-from-coal-rise-7-research-2431195-2023-09-05
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/coal/india-sees-alarming-29-rise-in-per-capita-coal-emissions-amidst-global-transition-to-clean-energy-report/103416838
Successfully retrieved content for URL: https://www.livemint.com/market/stock-market-news/coal-india-share-price-rises-5-heres-why-experts-see-a-further-

Successfully retrieved content for URL: https://www.thehindu.com/news/cities/chennai/plans-afoot-to-turn-thoothukudis-voc-port-as-green-hydrogen-hub-and-trans-shipment-hub/article67271000.ece
Failed to retrieve https://www.aninews.in/news/national/general-news/up-govt-to-accelerate-remaining-projects-under-pm-kusum-yojana20230905195236 with status code 403
Successfully retrieved content for URL: https://www.thehindubusinessline.com/data-stories/data-focus/data-focus-kusum-should-bloom-to-make-farmers-smile/article67280349.ece
Successfully retrieved content for URL: https://indianexpress.com/article/business/commodities/lpg-price-cut-ujjwala-expansion-could-cost-over-rs-37k-cr-annually-8921631/
Successfully retrieved content for URL: https://www.deccanherald.com/business/lpg-price-cut-ujjwala-expansion-could-cost-rs-37000-crore-annually-who-bears-the-load-2671161
Successfully retrieved content for URL: https://www.thehindu.com/news/cities/Delhi/city-is-in-bad-shape-have-to-clean-it-delh

Successfully retrieved content for URL: https://www.moneycontrol.com/news/business/indias-gail-expects-to-source-20-to-25-of-lng-on-short-term-or-spot-basis-11326641.html
Failed to retrieve https://www.reuters.com/business/energy/gastech-indias-gail-expects-source-20-25-lng-short-term-or-spot-basis-2023-09-07/ with status code 401
Successfully retrieved content for URL: https://www.livemint.com/companies/news/gail-india-expects-to-source-20-25-of-lng-on-short-term-or-spot-basis-report-11694063569556.html
Failed to retrieve https://www.reuters.com/sustainability/oil-india-plans-net-zero-by-2040-invest-2-bln-projects-sources-2023-09-08/ with status code 401
Successfully retrieved content for URL: https://www.zeebiz.com/markets/stocks/news-ongc-share-price-bse-news-opal-invest-rs-15000-crore-in-ongc-petro-additions-gail-252605
Successfully retrieved content for URL: https://www.zeebiz.com/companies/news-ongc-to-infuse-rs-15000-crore-in-opal-edge-out-gail-to-take-control-of-petchem-firm-25

Successfully retrieved content for URL: https://auto.hindustantimes.com/auto/electric-vehicles/volkswagen-in-advanced-talks-with-mahindra-to-share-meb-platform-and-components-for-budget-ev-41693816778344.html
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/oil-and-gas/diesel-petrol-consumption-rises-year-on-year-on-increased-mobility/103471158
Successfully retrieved content for URL: https://www.autocarpro.in/news-international/volvo-cars-opens-new-tech-hub-and-global-innovation-centre-in-singapore--116644
Successfully retrieved content for URL: https://m.timesofindia.com/business/india-business/diesel-woes-boon-for-hyundais-suv-sales/articleshow/103410092.cms
Failed to retrieve https://www.business-standard.com/industry/auto/mahindra-to-use-volkswagen-s-meb-components-for-ev-production-in-india-123090400610_1.html with status code 403
Successfully retrieved content for URL: https://www.cnbctv18.com/auto/mahindra-to-use-volkswagens-meb-electric-c

Failed to retrieve https://www.arabnews.com/node/2368286/business-economy with status code 403
Successfully retrieved content for URL: https://www.indiainfoline.com/article/news-top-story/ntpc-oil-india-sign-agreement-to-explore-collaborations-in-renewable-energy-sector-1693794105849_1.html
Failed to retrieve https://www.financialexpress.com/business/industry-group-must-unite-for-swift-action-on-green-transition-industry-3237571/ with status code 403
Failed to retrieve https://www.pv-magazine.com/2023/09/06/the-case-for-hard-carbon-based-sodium-ion-batteries/ with status code 403
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/renewable/g20-summit-2023-expectations-of-indias-renewable-energy-industry/103499136
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/oil-and-gas/crude-shipments-to-india-hit-six-month-low-as-russian-oil-exports-decline-report/103464027
Successfully retrieved content for URL: https:/

Successfully retrieved content for URL: https://www.autocarpro.in/news-international/aprilia-rs-457-unveiled-india-launch-expected-soon-116674
Successfully retrieved content for URL: https://www.thehindubusinessline.com/companies/nlc-proposes-a-capex-of-82174-crore-for-mining-and-power-generation-capacity-expansion-by-2030/article67270071.ece
Successfully retrieved content for URL: https://indianexpress.com/article/business/market/poor-monsoon-to-drive-tractor-sales-in-slow-lane-this-year-8924657/
Successfully retrieved content for URL: https://www.autocarpro.in/news/swaraj-tractors-introduces-new-40-50hp-range-at-rs-69-lakh-116623
Successfully retrieved content for URL: https://m.timesofindia.com/city/ahmedabad/states-ev-dreams-major-automakers-on-the-radar/articleshow/103545213.cms
Successfully retrieved content for URL: https://www.autocarpro.in/news-national/vedanta-aluminium-helps-ev-makers-with-lightweight-solutions-116681
Successfully retrieved content for URL: https://m.timesof

Successfully retrieved content for URL: https://www.thehindubusinessline.com/opinion/problem-posed-by-converging-nominal-real-gdp/article67274563.ece
Successfully retrieved content for URL: https://www.thehindu.com/opinion/op-ed/the-importance-of-states-in-space-missions/article67241335.ece
Successfully retrieved content for URL: https://www.hellenicshippingnews.com/india-gdp-growth-accelerates-in-2q23/
Successfully retrieved content for URL: https://www.livemint.com/economy/charting-a-realistic-path-of-india-s-growth-11693913790417.html
Successfully retrieved content for URL: https://www.thehindu.com/data/the-week-in-5-charts-indias-driest-august-gdp-first-quarter-growth-and-more/article67259173.ece
Successfully retrieved content for URL: https://www.zeebiz.com/economy-infra/news-morgan-stanley-raises-india-gdp-forecast-after-q1-data-surprises-positively-251872
Successfully retrieved content for URL: https://www.moneycontrol.com/news/opinion/the-curious-case-of-exports-imports-in-june

Successfully retrieved content for URL: https://indianexpress.com/article/cities/delhi/dwarka-expressway-tunnel-work-nhai-eight-lanes-expressway-8943148/
Successfully retrieved content for URL: https://m.timesofindia.com/city/lucknow/up-govt-to-focus-on-north-south-link-to-boost-road-connectivity/articleshow/103725912.cms
Successfully retrieved content for URL: https://infra.economictimes.indiatimes.com/news/roads-highways/collect-only-50-toll-fee-at-vagaikulam-hc-to-nhai/103600884
Successfully retrieved content for URL: https://sundayguardianlive.com/business/national-logistics-policy-accelerates-india-towards-economic-efficiency
Successfully retrieved content for URL: https://m.timesofindia.com/city/chennai/veera-to-be-deployed-on-national-highways-in-future/articleshow/103621439.cms
Successfully retrieved content for URL: https://infra.economictimes.indiatimes.com/news/roads-highways/govt-asks-natl-highways-to-improve-road-infra/103626775
Successfully retrieved content for URL: http

Failed to retrieve https://www.financialexpress.com/business/industry-net-zero-target-oil-india-to-invest-rs-25000-crore-by-2040-3243993/ with status code 403
Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/oil-and-gas/haryana-govt-to-set-up-retail-fuel-outlets-in-11-jail-complexes/103650547
Successfully retrieved content for URL: https://health.economictimes.indiatimes.com/news/pharma/attracting-young-talent-to-the-pharma-industry-in-india-a-pathway-to-innovation-and-growth/103710062
Successfully retrieved content for URL: https://health.economictimes.indiatimes.com/news/industry/digital-supply-chain-an-imperative-for-success-in-the-pharma-industry/103675792
Successfully retrieved content for URL: https://www.thehindubusinessline.com/specials/pulse/seeking-a-high-pitched-shout-out-for-public-health-at-unga/article67318281.ece
Failed to retrieve https://www.business-standard.com/industry/news/cabinet-approves-foreign-investment-of-up-to-rs-9-589

Successfully retrieved content for URL: https://www.livemint.com/companies/news/grasim-to-launch-paints-biz-by-early-next-yr-11694714712924.html
Successfully retrieved content for URL: https://m.economictimes.com/industry/cons-products/paints/aditya-birla-group-to-launch-its-paints-business-birla-opus-in-q4/articleshow/103653402.cms
Failed to retrieve https://www.businessworld.in/article/Paint-Industry-Shades-Of-Success/16-09-2023-491557 with status code 202
Successfully retrieved content for URL: https://www.livemint.com/companies/news/general-mills-india-to-invest-rs-100-crore-in-new-manufacturing-facility-for-pillsbury-baking-mixes-in-new-delhi-11694681856057.html
Successfully retrieved content for URL: https://www.newindianexpress.com/business/2023/sep/14/indias-fast-food-chainslaid-low-by-inflation-chicken-items-still-in-vogue-2614848.html
Successfully retrieved content for URL: https://health.economictimes.indiatimes.com/news/industry/indias-exposure-to-lead-cost-the-country-9-of

Failed to retrieve https://www.business-standard.com/economy/news/pm-modi-to-lay-foundation-stone-for-bina-refinery-expansion-on-14-sept-123091201154_1.html with status code 403
Successfully retrieved content for URL: https://www.newsip.in/indias-petrochemical-revolution-pm-modi-unveils-bpcls-rs-49000-crore-mega-project/
Failed to retrieve https://www.financialexpress.com/market/commodities-india-may-end-use-of-fossil-fuel-much-before-2070-oil-minister-3241445/ with status code 403
Successfully retrieved content for URL: https://www.livemint.com/companies/iocl-bpcl-hpcls-gross-refining-margins-to-remain-in-the-range-of-9-to-10-bbl-careedge-11694497680983.html
Successfully retrieved content for URL: https://swarajyamag.com/infrastructure/explained-why-india-and-saudi-arabia-are-expediting-the-50-billion-west-coast-refinery-project
Successfully retrieved content for URL: https://www.thehindubusinessline.com/data-stories/deep-dive/csr-spends-of-india-inc-increase-during-pandemic-led-by-ri

Successfully retrieved content for URL: https://m.timesofindia.com/auto/news/in-fast-lane-car-companies-eye-4-million-sales-in-2023/articleshow/103616795.cms
Successfully retrieved content for URL: https://www.thehindu.com/business/vst-group-plans-expansion-across-verticals-in-india-abroad/article67315267.ece
Successfully retrieved content for URL: https://www.livemint.com/brand-stories/gs-caltex-launches-kixx-smart-bike-stations-in-hyderabad-11686302460073.html
Successfully retrieved content for URL: https://www.autocarpro.in/news/jeep-launches-compass-2wd-diesel-automatic-at-rs-2399-lakh-116793
Successfully retrieved content for URL: https://www.thehindu.com/news/first-commercial-production-of-acc-batteries-under-pli-to-start-from-decemberjanuary-mahendra-nath-pandey/article67299052.ece
Failed to retrieve https://www.mobilityoutlook.com/news/greencell-mobility-signs-mou-with-volvo-eicher-to-source-1000-e-buses/ with status code 403
Failed to retrieve https://www.mobilityoutlook.com/f

Successfully retrieved content for URL: https://energy.economictimes.indiatimes.com/news/renewable/industry-body-submits-recommendations-to-govt-to-fast-track-green-hydrogen-mission/103665424
Failed to retrieve https://www.aninews.in/news/national/general-news/phdcci-presents-10-recommendations-to-accelerate-indias-national-green-hydrogen-mission20230914175018 with status code 403
Successfully retrieved content for URL: https://www.livemint.com/industry/green-hydrogen-for-steelmaking-in-india-will-only-catch-up-by-2050-says-report-11694695010439.html
Successfully retrieved content for URL: https://www.hellenicshippingnews.com/abs-issues-aip-for-hanwha-oceans-industry-first-zero-carbon-gas-carrier-16/
Successfully retrieved content for URL: https://www.benzinga.com/pressreleases/23/09/34555665/fuel-dispensing-equipment-market-growth-report-2023-2030-115-pages-report
Failed to retrieve https://www.reuters.com/business/energy/gazprom-delivers-its-first-lng-cargo-china-via-arctic-2023-09-1

KeyboardInterrupt: 

## Reading file

In [1195]:
import requests
import pandas as pd

# File path
excel_file = "BPCLIPEX News Links (1).xlsx"

# Read all sheets into a dictionary of DataFrames
dfs = pd.read_excel(excel_file, sheet_name=None)

In [1196]:
# Storing each dataframe in a list
dfs_list = []

for sheet_name, df in dfs.items():
    df['Sheet'] = sheet_name
    dfs_list.append(df)

df = pd.concat(dfs_list, ignore_index=True)

In [1197]:
df

Unnamed: 0,Source,Links,Sector,Shortlisted Yes / No,Remark,Text,Sheet,Shortlisted Yes/No,Copied Text,Shortlist,@
0,Livemint,India can usher in the beginning of the end of...,TRANSPORT,,,,August23,,,,
1,Deccan Herald,"Volvo India to bid in govt’s e-bus programme, ...",TRANSPORT,,,,August23,,,,
2,Financial Express,Optimizing warehouse space: How ASRS maximizes...,LOGISTICS,,,,August23,,,,
3,Times Of India,Bharat NCAP to be launched by Nitin Gadkari ne...,TRANSPORT,,,,August23,,,,
4,Times Of India,Andhra Pradesh's Avera making waves on global ...,EV,,,,August23,,,,
...,...,...,...,...,...,...,...,...,...,...,...
12048,Autocar Professional,M&M projects 5% growth in domestic tractor ind...,,,,,May2024,,,No,
12049,Economic Times,"Escorts Kubota plans to invest up to Rs 4,500 ...",,,,,May2024,,,No,
12050,Autocar Professional,Domestic tractor industry may see lower mid-si...,,,,,May2024,,,No,
12051,Economic Times,Expecting good response from many companies on...,,,,,May2024,,,No,


# Text preprocessing 

## Dropping and merging duplicate columns

In [1198]:
# Checking for all the sheets
df.Sheet.unique()

array(['August23', 'September23', 'October 23', 'November 23',
       'December 23', 'Dec23 Shortlist', 'January24', 'February2024',
       'March2024', 'April2024', 'May2024'], dtype=object)

In [1199]:
# Column names
df.columns

Index(['Source', 'Links', 'Sector', 'Shortlisted Yes / No', 'Remark', 'Text',
       'Sheet', 'Shortlisted Yes/No', 'Copied Text', 'Shortlist ', '@'],
      dtype='object')

In [1200]:
# Dropping irrelevant columns
df=df.drop(['Source','Remark','Text','Copied Text','@'], axis=1)

In [1201]:
df.columns

Index(['Links', 'Sector', 'Shortlisted Yes / No', 'Sheet',
       'Shortlisted Yes/No', 'Shortlist '],
      dtype='object')

In [1202]:
#If any of the columns cotains a value other than Yes or No, it is copied to the new column, otherwise None object is stored.

if df['Shortlisted Yes / No'].any():
    df['Shortlisted']=df['Shortlisted Yes / No']
elif df['Shortlisted Yes/No'].any():
    df['Shortlisted']=df['Shortlisted Yes/No']
elif df['Shortlist '].empty().any():
    df['Shortlisted']=df['Shortlist ']
else:
    df['Shortlisted']=None

In [1203]:
df.Shortlisted.unique()

array([nan, 'No', 'Yes', 'Yes ', 'Np', 'NO', ' No'], dtype=object)

In [1204]:
df['Shortlisted']=df['Shortlisted'].str.replace(' ','')                                     # Removing spaces
df['Shortlisted']=df['Shortlisted'].replace(to_replace='Yes',value='1')                     # Replacing Yes with 1
df['Shortlisted']=df['Shortlisted'].replace(to_replace=[None, 'No', 'Np', 'NO'],value='0')  # Replacing any other value with 0

In [1205]:
df.Shortlisted.unique()

array(['0', '1'], dtype=object)

In [1206]:
df

Unnamed: 0,Links,Sector,Shortlisted Yes / No,Sheet,Shortlisted Yes/No,Shortlist,Shortlisted
0,India can usher in the beginning of the end of...,TRANSPORT,,August23,,,0
1,"Volvo India to bid in govt’s e-bus programme, ...",TRANSPORT,,August23,,,0
2,Optimizing warehouse space: How ASRS maximizes...,LOGISTICS,,August23,,,0
3,Bharat NCAP to be launched by Nitin Gadkari ne...,TRANSPORT,,August23,,,0
4,Andhra Pradesh's Avera making waves on global ...,EV,,August23,,,0
...,...,...,...,...,...,...,...
12048,M&M projects 5% growth in domestic tractor ind...,,,May2024,,No,0
12049,"Escorts Kubota plans to invest up to Rs 4,500 ...",,,May2024,,No,0
12050,Domestic tractor industry may see lower mid-si...,,,May2024,,No,0
12051,Expecting good response from many companies on...,,,May2024,,No,0


In [1207]:
# Dropping duplicate columns
df=df.drop(['Shortlisted Yes / No', 'Shortlisted Yes/No', 'Shortlist '], axis=1)

In [1208]:
df.shape

(12053, 4)

In [1209]:
df.columns

Index(['Links', 'Sector', 'Sheet', 'Shortlisted'], dtype='object')

In [1210]:
df['Shortlisted'].value_counts()

Shortlisted
0    11653
1      400
Name: count, dtype: int64

## Dropping null rows

In [1211]:
df['Links'].isnull().sum()

42

In [1212]:
df[df['Links'].isnull()]

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1659,,,October 23,0
1677,,,October 23,0
1742,,,October 23,0
1779,,,October 23,0
1822,,,October 23,0
1841,,,October 23,0
9855,,,April2024,0
9872,,,April2024,0
9905,,,April2024,0
9928,,,April2024,0


In [1213]:
df=df.dropna(subset=['Links'])

In [1214]:
df['Links'].isnull().sum()

0

In [1215]:
df.shape

(12011, 4)

## Cleaning 'Sector' values

In [1216]:
df.Sector.unique()

array(['TRANSPORT', 'LOGISTICS', 'EV', 'INFRA', 'TRADE', 'ELECTRONICS',
       'ENERGY', 'AGRICULTURE', 'TOURISM', 'PHARMA', 'FOOD', 'HOTEL',
       'RETAL', 'HOSPITALITY', 'AVIATION', 'TRAVEL', 'RENEWABLE',
       'MINING', 'PETROLEUM', 'RENEWABLES', 'AIRPORT', 'COAL', 'BIOGAS',
       'STEEL', 'ADHESIVES', 'BIOFUEL', 'DETERGENT', 'PORT',
       'CONSTRUCTION', 'TELECOM', 'NHAI', 'HIGHWAY', 'ROAD', 'BUSINESS',
       'MARINE', 'POWER', 'INFRA ', 'Cooking', 'GREEN ENERGY', 'ECONOMY',
       'CRUDE', 'FMCG', 'FUEL', 'SUSTAINABILITY', 'PETCHEEM', 'OIL',
       'PETROCHEM', 'POLICY', 'TEECH', 'MARKET', 'AUTO', nan,
       'MANUFACTURING', 'PAINT', 'CORRIDOR', 'MOBILITY', 'SUGAR',
       'CARBON EMISSION', 'HYDROCARBON', 'FREIGHT CORRIDOR',
       'LOGISTICS CORRIDOR', 'DETERGENTS', 'OMC', 'RETAIL', 'SHIPPING',
       'BUZZ', 'LOGISTIC HUB', 'ADHESIVE', 'PAINTS', 'GDP', 'Natural Gas',
       'Reenwable Energy', 'Technology', 'Retail', 'Aviation',
       'Electric Mobility', 'Competitor', '

In [1217]:
len(df.Sector.unique())

179

In [1218]:
import re

In [1219]:
# Filling NaN values with empty string to ensure consistent regular expression application
df['Sector'] = df['Sector'].fillna('')

In [1220]:
# Removing parantheses from the end of the sectors
def parantheses(text):
    text=re.sub(r'\)','',text)
    return text

df['Sector']=df['Sector'].apply(parantheses)

In [1221]:
# Convert to uppercase
df['Sector']=df['Sector'].str.upper()

In [1222]:
len(df.Sector.unique())

131

In [1223]:
# Rows containing any digits are emptied (to remove random values)
df=df[~df['Sector'].str.contains('\.', regex=True)]

In [1224]:
len(df.Sector.unique())

128

### AUTO

In [1225]:
df[df['Sector'] == 'AITO']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
11825,Flipkart and Bajaj Auto join forces,AITO,May2024,0


In [1226]:
df[df['Sector'] == 'AOTO']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
9439,"JSW MG Motor India to invest Rs 5,000 cr; focu...",AOTO,March2024,0
9440,JSW MG Motor India plans to sell a million ele...,AOTO,March2024,0


In [1227]:
df[df['Sector'] == 'AUTO']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
234,Auto sales August 2023: Expect decent retail s...,AUTO,September23,0
1501,DPIIT and Gati Shakti Vishwavidyalaya forge al...,AUTO,October 23,0
1502,The transformative potential of technology in ...,AUTO,October 23,0
1525,Robotics adoption is impacting the coatings ma...,AUTO,October 23,0
1526,GCC light vehicle aftermarket revenues to top ...,AUTO,October 23,0
2363,Road to virtual world: Accelerating digital tr...,AUTO,October 23,0
2454,"Driven by EV and clean energy, auto sector dea...",AUTO,October 23,0
2703,"Despite slowdown in Shradh period, Navratri dr...",AUTO,November 23,0
3144,Schaeffler's India unit Q2 profit rises on rob...,AUTO,November 23,0
4114,Auto retail sales jump by historic 18% y-o-u i...,AUTO,December 23,0


In [1228]:
# AUTO
def auto_sector(text):
    text=re.sub(r'A.TO','AUTO',text)
    return text

df['Sector']=df['Sector'].apply(auto_sector)

In [1229]:
len(df.Sector.unique())

126

### LOGISTICS

In [1230]:
df[df['Sector'] == 'LOGISTICS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2,Optimizing warehouse space: How ASRS maximizes...,LOGISTICS,August23,0
137,Centre asks states to formulate logistics policy,LOGISTICS,September23,0
281,Mahindra Logistics collaborates with Flipkart ...,LOGISTICS,September23,0
294,E-commerce set to drive country's logistics in...,LOGISTICS,September23,0
297,Centre asks states to formulate logistics policy,LOGISTICS,September23,0
...,...,...,...,...
11717,India sees its third unicorn of 2024 in logist...,LOGISTICS,May2024,0
11724,Logistics Boom: Demand for warehouse automatio...,LOGISTICS,May2024,0
11729,"Citroën signs MoU with OHM E Logistics more 1,...",LOGISTICS,May2024,0
11731,Safexpress Unveils Tamil Nadu’s Largest Logist...,LOGISTICS,May2024,0


In [1231]:
df[df['Sector'] == 'LOGIST']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2306,CJ Darcl Logistics and Tata Motors partner to ...,LOGIST,October 23,0


In [1232]:
df[df['Sector'] == 'LOGISTIC PARK']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
4501,NHAI collaborates with private SPV to develop ...,LOGISTIC PARK,December 23,1


In [1233]:
df[df['Sector'] == 'LOGISTIC HUB']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1109,"India export duty stifles activity, prices eas...",LOGISTIC HUB,September23,0
3460,"IWLS to celebrate its 10th year in Mumbai, con...",LOGISTIC HUB,November 23,0
3865,Govt plan to shift wholesale market irks small...,LOGISTIC HUB,December 23,0
4194,DP World delves into Air Cargo with launch of ...,LOGISTIC HUB,December 23,0
4195,"COSCO, bp to bolster cooperation on hydrocarbo...",LOGISTIC HUB,December 23,0
4567,Govt proposes steep hike in circle rates,LOGISTIC HUB,December 23,0
4568,German-based aluplast GmbH looks to invest €4 ...,LOGISTIC HUB,December 23,0
6403,"TVS Industrial & Logistics Parks to invest ₹1,...",LOGISTIC HUB,January24,0
7169,The Renaissance of Manufacturing Supply Chains...,LOGISTIC HUB,February2024,0
7171,SoftBank-backed Meesho unveils Valmo to enable...,LOGISTIC HUB,February2024,0


In [1234]:
df[df['Sector'] == 'LOGISTICS CORRIDOR']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
697,"Global Biofuel Alliance, UK-Middle-East econom...",LOGISTICS CORRIDOR,September23,0


In [1235]:
df[df['Sector'] == 'LOGISTICS HUB']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1604,Essar's Black Box expands footprints in India ...,LOGISTICS HUB,October 23,0
5009,Nexzu Mobility signs MoU with Gujarat governme...,LOGISTICS HUB,December 23,0
5807,Extended producer responsibility in India and ...,LOGISTICS HUB,January24,0
6378,"Ivanhoe Cambridge, LOGOS to invest Rs 1,100 cr...",LOGISTICS HUB,January24,0
8543,Kerala offers investment subsidy for logistics...,LOGISTICS HUB,March2024,0
11818,Indian container cargo set to expand by 8% in ...,LOGISTICS HUB,May2024,0
11826,Indian carriers to expand global network from ...,LOGISTICS HUB,May2024,0
11827,Kerala Boosts Logistics Capability to Enhance ...,LOGISTICS HUB,May2024,0
11828,"Flipkart, Adani partner to set up data centre,...",LOGISTICS HUB,May2024,0


In [1236]:
df[df['Sector'] == 'LOGISTIC CORRIDOR']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
4192,India’s extreme rain was restricted to a ‘corr...,LOGISTIC CORRIDOR,December 23,0
4197,ADB injects $250 million to boost India's indu...,LOGISTIC CORRIDOR,December 23,0


In [1237]:
# LOGISTICS
def logistics_sector(text):
    text = re.sub(r'\s*LOGIST\w*\s*\w*','LOGISTICS', text)
    text = re.sub(r";\w+", "LOGISTICS", text)
    text = re.sub(r'LOGISTICS\s+','LOGISTICS', text)
    return text

df['Sector']=df['Sector'].apply(logistics_sector)

In [1238]:
len(df.Sector.unique())

117

### TRANSPORT

In [1239]:
df[df['Sector'] == 'TRANSPORT']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,India can usher in the beginning of the end of...,TRANSPORT,August23,0
1,"Volvo India to bid in govt’s e-bus programme, ...",TRANSPORT,August23,0
3,Bharat NCAP to be launched by Nitin Gadkari ne...,TRANSPORT,August23,0
284,Optimising resources and transport infrastruct...,TRANSPORT,September23,0
290,Addressing the challenges of delivering reliab...,TRANSPORT,September23,0
291,Sarbanada Sonowal : Inland waterways powering ...,TRANSPORT,September23,0
292,PM Gati Shakti fuels growth in India's inland ...,TRANSPORT,September23,0
296,Transport Corporation of India thrives as auto...,TRANSPORT,September23,0
598,India at G20 likely to push for green developm...,TRANSPORT,September23,0
1062,Optimising resources and transport infrastruct...,TRANSPORT,September23,0


In [1240]:
df[df['Sector'] == 'TRANSPORTATION']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
3003,Will India’s Domestic Industrial Policy Help I...,TRANSPORTATION,November 23,0
3005,Traditional coir industry sees decline in dist...,TRANSPORTATION,November 23,0
3006,India's metro rail network poised to surpass U...,TRANSPORTATION,November 23,0
3387,Govt Outsourced Critical Projects Worth Rs 500...,TRANSPORTATION,November 23,0
3513,McDermott awarded transportation and installat...,TRANSPORTATION,November 23,0
...,...,...,...,...
11707,NCLT allows IL&FS Transportation Networks to o...,TRANSPORTATION,May2024,0
11708,JSW One platform smashes $1 billion GMV milestone,TRANSPORTATION,May2024,0
11710,Transport Corp. of India Looking to Place Carg...,TRANSPORTATION,May2024,0
11711,IMEC: Indian team in UAE discusses start of wo...,TRANSPORTATION,May2024,0


In [1241]:
# TRANSPORT
def transport_sector(text):
    text=re.sub(r'TRANSPORT\w+','TRANSPORT',text)
    return text

df['Sector']=df['Sector'].apply(transport_sector)

In [1242]:
len(df.Sector.unique())

116

### INFRASTRUCTURE

In [1243]:
df[df['Sector'] == 'INFAR']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
7100,Zomato leases its largest warehousing space in...,INFAR,February2024,0


In [1244]:
df[df['Sector'] == 'INFRA ']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
133,"Adani To Invest Rs 2,000 Crore To Build 2 Tran...",INFRA,August23,0


In [1245]:
df[df['Sector'] == 'INFRA']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
5,India has taken action to de-risk infrastructu...,INFRA,August23,0
7,Warehousing and logistics sees HNI interest in...,INFRA,August23,0
15,Cabinet approves seven rail projects worth ₹32...,INFRA,August23,0
69,What India should do to bridge skilled manpowe...,INFRA,August23,0
95,Massive cost overruns plague 388 infrastructur...,INFRA,August23,0
...,...,...,...,...
11462,India's infrastructure drive boosts TMT bar de...,INFRA,May2024,0
11463,RailTel Sets Sights On Vietnam's Mega Infrastr...,INFRA,May2024,0
11815,"France, Germany to fund India's urban infra mi...",INFRA,May2024,0
11816,Adobe to offer India data centre infrastructur...,INFRA,May2024,0


In [1246]:
# INFRASTRUCTURE
def infrastructure_sector(text):
    text=re.sub(r'INF..','INFRASTRUCTURE',text)
    text = re.sub(r'INFRA\s+','INFRASTRUCTURE', text)
    return text

df['Sector']=df['Sector'].apply(infrastructure_sector)

In [1247]:
len(df.Sector.unique())

115

### TRADE

In [1248]:
df[df['Sector'] == 'TRADE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
6,USTR to visit India next week for G20 trade & ...,TRADE,August23,0
50,Saudi Arabia's Crude Oil Exports Slump To 21-M...,TRADE,August23,0
121,"India, UAE make first-ever crude oil transacti...",TRADE,August23,0
123,India-Saudi ties a defining relationship of th...,TRADE,August23,0
130,Bengaluru is the United States’ most important...,TRADE,August23,0
185,Greece — India’s gateway to EU,TRADE,September23,0
222,India's Russian crude imports decline sharply ...,TRADE,September23,0
247,Ikea to start selling online in NCR by Decembe...,TRADE,September23,0
278,Asean-India alliance emerges as formidable for...,TRADE,September23,0
282,Russia-India energy ties to increase this year...,TRADE,September23,0


In [1249]:
df[df['Sector'] == 'TARDE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
11061,China's share in India's industrial goods impo...,TARDE,May2024,0


In [1250]:
# TRADE
def trade_sector(text):
    text=re.sub(r'T..DE','TRADE',text)
    return text

df['Sector']=df['Sector'].apply(trade_sector)

In [1251]:
len(df.Sector.unique())

114

### ELECTRONICS

In [1252]:
df[df['Sector'] == 'ELECTRONICS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
8,What India needs to become an electronics manu...,ELECTRONICS,August23,0
54,Urge India to reconsider PC import restriction...,ELECTRONICS,August23,0
55,Is India entering a semiconductor ‘red ocean’?...,ELECTRONICS,August23,0
56,Government targets 80% local value addition in...,ELECTRONICS,August23,0
57,"Mobile makers to hire 60,000 in 6 to 12 months...",ELECTRONICS,August23,0
...,...,...,...,...
11431,Tata Electronics begins export of semiconducto...,ELECTRONICS,May2024,0
11432,Tata Electronics ships ‘made in India’ chip sa...,ELECTRONICS,May2024,0
11822,Veira to invest Rs 450 cr on new unit to incre...,ELECTRONICS,May2024,0
11831,Indias electronic goods exports up 25.8% YoY i...,ELECTRONICS,May2024,0


In [1253]:
df[df['Sector'] == 'ELECTROLINICS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
8582,India's electronics industry meets targets; ra...,ELECTROLINICS,March2024,0
8583,Dholera chip fab: India's leap into electronic...,ELECTROLINICS,March2024,0
8584,"Mobile exports to grow to $50-60 bn, electroni...",ELECTROLINICS,March2024,0
8585,Joint ventures in focus: Indo-Dutch semiconduc...,ELECTROLINICS,March2024,0
8586,CG Power Ventures into OSAT Facility,ELECTROLINICS,March2024,0
8587,India will soon make equipment for semiconduct...,ELECTROLINICS,March2024,0
8588,Government to boost funds for India Semiconduc...,ELECTROLINICS,March2024,0
8589,"Foxconn, Samsung, 3 others to get Rs 4,400+ cr...",ELECTROLINICS,March2024,0
8594,India becomes second-largest manufacturer of m...,ELECTROLINICS,March2024,0


In [1254]:
df[df['Sector'] == 'ELECTRONNICS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
5496,Government measures to develop India as a glob...,ELECTRONNICS,January24,0
5497,"With more local value additions, electronics m...",ELECTRONNICS,January24,0
5498,"Premium products, investments to drive consume...",ELECTRONNICS,January24,0
5499,"Evaluating multiple EoIs for OSATs, fabs, mode...",ELECTRONNICS,January24,0
5500,Industry-academia connect make TN enviable for...,ELECTRONNICS,January24,0
5501,"India, Oman free trade agreement likely to be ...",ELECTRONNICS,January24,0
5502,"Foxconn officials meet CM, discuss more projec...",ELECTRONNICS,January24,0


In [1255]:
# ELECTRONICS
def electronics_sector(text):
    text=re.sub(r'ELECTRO\w+','ELECTRONICS',text)
    return text

df['Sector']=df['Sector'].apply(electronics_sector)

In [1256]:
len(df.Sector.unique())

112

### RENEWABLES

In [1257]:
df[df['Sector'] == 'REENWABLE ENERGY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1503,"KSA, India sign MoU to boost cooperation on re...",REENWABLE ENERGY,October 23,1


In [1258]:
df[df['Sector'] == 'RENEWABLE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
49,FE-ECube Study: How some trailblazing companie...,RENEWABLE,August23,0
73,India government sets emission limit for hydro...,RENEWABLE,August23,0
135,Energy efficiency and renewables: Importance i...,RENEWABLE,August23,0
142,US and India join forces to launch renewable e...,RENEWABLE,September23,0
144,Offshore wind energy: India set to harness coa...,RENEWABLE,September23,0
198,India's renewable energy ambitions could excee...,RENEWABLE,September23,0
199,Tata Power Renewable in deal with steel maker ...,RENEWABLE,September23,0
204,India’s massive renewables deployment helps me...,RENEWABLE,September23,0
731,India sees alarming 29% rise in per capita coa...,RENEWABLE,September23,0
734,India among five major global economies in rac...,RENEWABLE,September23,0


In [1259]:
df[df['Sector'] == 'RENEWABLES']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
53,India govt sets emission limit for hydrogen to...,RENEWABLES,August23,0
89,Adani Green Energy targets 45 GW of renewable ...,RENEWABLES,August23,0
92,Adani Green Energy to raise $5 billion through...,RENEWABLES,August23,0
1065,India-Saudi Arabia power grid pact to accelera...,RENEWABLES,September23,0
2445,Govt plans to register only India-made solar p...,RENEWABLES,October 23,0
...,...,...,...,...
11808,India’s installed wind energy capacity to rise...,RENEWABLES,May2024,0
11809,Italy’s Enel Green Power acquires Indian renew...,RENEWABLES,May2024,0
11835,Rajasthan aims for 90 GW of renewable energy b...,RENEWABLES,May2024,0
11836,Radiance to raise $150 million for 2GW expansi...,RENEWABLES,May2024,0


In [1260]:
df[df['Sector'] == 'GREEN ENERGY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
146,Going net-zero,GREEN ENERGY,September23,0
601,India well-placed to be hydrogen exporter to d...,GREEN ENERGY,September23,0
724,India has bright growth prospects-Morgan Stanley,GREEN ENERGY,September23,0
1886,"The green shift: India embraces challenges, ta...",GREEN ENERGY,October 23,0
1889,India-Sweden Innovation Day 2023: Fostering Su...,GREEN ENERGY,October 23,0


In [1261]:
df[df['Sector'] == 'ENERGY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
9,"Govt unveils Green Hydrogen standards, sets em...",ENERGY,August23,0
125,New gas discovery in prolific Indian west coas...,ENERGY,August23,0
126,Crude rallies on expectations of tightening su...,ENERGY,August23,0
127,Advancing India's 'Nett Zero' mission: Conclav...,ENERGY,August23,0
132,NLC India inks pact to supply 300 MW solar pow...,ENERGY,August23,0
134,Adani Greens Mundra Solar Energy Gets Approval...,ENERGY,August23,0
143,EXPLAINER: India’s Green Hydrogen Standard,ENERGY,September23,0
225,India's Net-Zero Transition Offers $12.7 Trill...,ENERGY,September23,0
253,"India, EU differ on ICAO’s green fuel framework",ENERGY,September23,0
257,Blue Energy Motors bags green truck contract f...,ENERGY,September23,0


In [1262]:
# RENEWABLES
def renewables_sector(text):
    text=re.sub(r'\w*WABLE\w*\s*\w*','RENEWABLES',text)
    return text

df['Sector']=df['Sector'].apply(renewables_sector)

In [1263]:
len(df.Sector.unique())

110

### GREEN ENERGY

In [1264]:
# GREEN ENERGY
def energy_sector(text):
    text=re.sub(r'\w*\s*\w*ENERGY','GREEN ENERGY',text)
    return text

df['Sector']=df['Sector'].apply(energy_sector)

In [1265]:
len(df.Sector.unique())

109

### HOTEL

In [1266]:
df[df['Sector'] == 'HOTEL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
30,Hotel industry's contribution to India's GDP t...,HOTEL,August23,0
34,Domestic hotel industry estimated to contribut...,HOTEL,August23,0
35,Hotel industry’s contribution to India’s GDP t...,HOTEL,August23,0
36,‘Hotel industry to contribute $1.5 tn to GDP b...,HOTEL,August23,0
40,"State hikes min wages by Rs 100 for mfg, const...",HOTEL,August23,0
...,...,...,...,...
11426,Marriott Opens 150th Hotel in Katra,HOTEL,May2024,0
11782,Radisson Hotel Group unveils Mandrem Beach Res...,HOTEL,May2024,0
11784,Indian Hotels Company Ltd likely to open more ...,HOTEL,May2024,0
11786,"Katra, Kasauli, Kashmir: Five-star hotels expa...",HOTEL,May2024,0


In [1267]:
df[df['Sector'] == 'HOTELS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2312,High on hotels: A gold rush starts as business...,HOTELS,October 23,0


In [1268]:
# HOTEL
def hotel_sector(text):
    text=re.sub(r'HOTELS','HOTEL',text)
    return text

df['Sector']=df['Sector'].apply(hotel_sector)

In [1269]:
len(df.Sector.unique())

108

### RETAIL

In [1270]:
df[df['Sector'] == 'RETAL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
33,Retail sales up 9% in July: Retailers Associat...,RETAL,August23,0


In [1271]:
df[df['Sector'] == 'RETAIL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1094,FADA’s 5th Auto Retail Conclave resolves to ‘C...,RETAIL,September23,0
1156,Indian retail industry to grow at 10 pc CAGR f...,RETAIL,September23,0
1510,"UP Govt to set up warehouses, Cargo terminals ...",RETAIL,October 23,1
1520,"Diesel sales fall in India, petrol goes up by ...",RETAIL,October 23,1
1576,New report shows Airbnb contributed over INR 7...,RETAIL,October 23,1
...,...,...,...,...
5354,Adani Ports may shut Krishnapatnam box termina...,RETAIL,Dec23 Shortlist,1
5355,New superhighways in India are connecting citi...,RETAIL,Dec23 Shortlist,1
5356,Government proposes 19.5 lakh crore national h...,RETAIL,Dec23 Shortlist,1
5369,Visakhapatnam port zips past 50-MT cargo handl...,RETAIL,Dec23 Shortlist,0


In [1272]:
# RETAIL
def retail_sector(text):
    text=re.sub(r'RETAL','RETAIL',text)
    return text

df['Sector']=df['Sector'].apply(retail_sector)

In [1273]:
len(df.Sector.unique())

107

### TRAVEL

In [1274]:
df[df['Sector'] == 'TRAVEL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
43,Over 20% uptick in premium lodging rates thank...,TRAVEL,August23,0
652,Digital Transformation of an Indian Travel Com...,TRAVEL,September23,0
654,India sees a 106% uptick in inbound travel in ...,TRAVEL,September23,0
655,G20 Summit Brings Back Global Attention To Ind...,TRAVEL,September23,0
656,Impact of G20 Summit on India’s Travel Industry,TRAVEL,September23,0
1089,Indian Navy Signs MOU With Uber For Private Tr...,TRAVEL,September23,0
1119,G20 summit will boost India's travel and touri...,TRAVEL,September23,0
1141,Introducing Patra Travels: A Game Changer in t...,TRAVEL,September23,0
1584,ICC World Cup is a massive boost for the trave...,TRAVEL,October 23,0
1586,Uttarakhand: No Inner-line permit required to ...,TRAVEL,October 23,0


In [1275]:
df[df['Sector'] == 'THRAVEL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
11790,"Indian outbound travel surges, Vietnam leads d...",THRAVEL,May2024,0
11792,India's travel boom: Record number of Indians ...,THRAVEL,May2024,0


In [1276]:
# TRAVEL
def travel_sector(text):
    text=re.sub(r'THRAVEL','TRAVEL',text)
    return text

df['Sector']=df['Sector'].apply(travel_sector)

In [1277]:
len(df.Sector.unique())

106

### MINING

In [1278]:
df[df['Sector'] == 'MINING']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
51,"Mining, Oil & Gas sector critical for sustaina...",MINING,August23,0
74,Indian Mining and Metals Sector Quest for Rene...,MINING,August23,0
179,India’s SpaceTech transformation is credited t...,MINING,September23,0
180,India asks US to release funds frozen over sus...,MINING,September23,0
735,IIT-Madras establishes Chair to explore urban ...,MINING,September23,0
...,...,...,...,...
11108,India gears up for critical minerals expansion...,MINING,May2024,0
11109,India to become leader in offshore mining as g...,MINING,May2024,0
11111,Mineral production index soars in FY 2023-24: ...,MINING,May2024,0
11699,"Mines ministry asks Coal India, NMDC to look f...",MINING,May2024,0


In [1279]:
df[df['Sector'] == 'MINIG']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
11447,BLECH India gears up for most anticipated shee...,MINIG,May2024,0


In [1280]:
# MINING
def mining_sector(text):
    text=re.sub(r'MINI\w+','MINING',text)
    return text

df['Sector']=df['Sector'].apply(mining_sector)

In [1281]:
len(df.Sector.unique())

105

In [1282]:
df[df['Sector'] == 'PETRO']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1594,"Oil prices ‘too high’, India calls for higher ...",PETRO,October 23,0


### PETCHEM

In [1283]:
df[df['Sector'] == 'PETCHEM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2394,Gulf Oil and S-OIL SEVEN join forces to expand...,PETCHEM,October 23,0


In [1284]:
df[df['Sector'] == 'PETCHEEM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
161,"PLI likely for chemicals, not petchem | Mint",PETCHEEM,September23,0


In [1285]:
df[df['Sector'] == 'PETROCHEM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
168,Betta Tank Robotics to Expand Portfolio of Rob...,PETROCHEM,September23,0
2400,How A 50% Plunge In Benchmark Refining Margin ...,PETROCHEM,October 23,0
5775,Duty concessions on petrochemical products key...,PETROCHEM,January24,0


In [1286]:
# PETCHEM
def petchem_sector(text):
    text=re.sub(r'PET\w*CHE\w*M','PETCHEM',text)
    return text

df['Sector']=df['Sector'].apply(petchem_sector)

In [1287]:
len(df.Sector.unique())

103

### BIOFUEL

In [1288]:
df[df['Sector'] == 'BIOGAS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
77,"India installs more than 11,000 small biogas p...",BIOGAS,August23,0
119,Kochi Corporation Council to decide on crucial...,BIOGAS,August23,0
223,Chennai to get waste-to-energy plant,BIOGAS,September23,0
227,New bio-waste unit in Andhra Pradesh to dispos...,BIOGAS,September23,0
723,Bengaluru to get 4 biogas plants in 4-5 months,BIOGAS,September23,0
2019,Global Biofuel Alliance guidelines expected in...,BIOGAS,October 23,0
2024,"By 2041, Delhi may generate over 19,000 tonnes...",BIOGAS,October 23,0
2026,"New investments worth Rs 2,755 cr pledged with...",BIOGAS,October 23,0
3137,More compressed biogas plants to become operat...,BIOGAS,November 23,0
3507,How to be smarter about methane slip – right now,BIOGAS,November 23,0


In [1289]:
df[df['Sector'] == 'BIOFUEL']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
87,"Indian, Russian scientists find way to extract...",BIOFUEL,August23,0
757,G20: India's Global Biofuel Alliance initiativ...,BIOFUEL,September23,0
758,India launches Global Biofuel Alliance at G20:...,BIOFUEL,September23,0
1887,Benefits India & world would derive from Globa...,BIOFUEL,October 23,0
2316,"From autonomous technology to biofuels, what a...",BIOFUEL,October 23,0


In [1290]:
# BIOFUEL
def biofuel_sector(text):
    text=re.sub(r'BIOGAS','BIOFUEL',text)
    return text

df['Sector']=df['Sector'].apply(biofuel_sector)

In [1291]:
len(df.Sector.unique())

102

### ADHESIVES

In [1292]:
df[df['Sector'] == 'ADHESIVE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1122,India's Odisha state approves IOCL's polyester...,ADHESIVE,September23,0


In [1293]:
df[df['Sector'] == 'ADHESIVES']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
84,"Positive Breakout: GMM Pfaudler, Page Industri...",ADHESIVES,August23,0
85,Boutique bakery industry gaining ground across...,ADHESIVES,August23,0
86,CoutLoot Aims to Dominate India's $300 bn Wort...,ADHESIVES,August23,0
753,Contact Adhesives Market Analysis Report 2023-...,ADHESIVES,September23,0
754,Wood Adhesives Market 2023-2030 | Surviving th...,ADHESIVES,September23,0
755,New sustainable label adhesive at Labelexpo fr...,ADHESIVES,September23,0
1088,Genetically modified bacteria degrade plastics...,ADHESIVES,September23,0
1183,Indian companies at the forefront at Labelexpo...,ADHESIVES,September23,0
1184,New products by Brilliant Polymers at Elite Sp...,ADHESIVES,September23,0
1637,Bostik Showcases New Adhesives Range at The In...,ADHESIVES,October 23,0


In [1294]:
# ADHESIVES
def adhesives_sector(text):
    text=re.sub(r'ADHESIVE\b','ADHESIVES',text)
    return text

df['Sector']=df['Sector'].apply(adhesives_sector)

In [1295]:
len(df.Sector.unique())

101

### DETERGENTS

In [1296]:
df[df['Sector'] == 'DETERGENTS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
761,Why Gujarati business families taste success,DETERGENTS,September23,0
1185,"Household, personal care sector to yield addit...",DETERGENTS,September23,0
2455,Uflex's QSR and takeaway packaging solutions,DETERGENTS,October 23,0
2456,Hindustan Unilever bets on core brands for growth,DETERGENTS,October 23,0
2458,Soaked in water scarcity: Dhobi Ghats hung out...,DETERGENTS,October 23,0
11174,Tata Chemicals Q4 results: Net loss at Rs 850 ...,DETERGENTS,May2024,0
11502,Rural FMCG demand outpaces urban for 1st time ...,DETERGENTS,May2024,0


In [1297]:
df[df['Sector'] == 'DETERGENT']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
90,Water Soluble Films Market projected to grow a...,DETERGENT,August23,0
91,"Sea6 Energy: Farming the ocean, and exploring ...",DETERGENT,August23,0
136,ChrysCapital still in race to buy controlling ...,DETERGENT,August23,0
155,2023 Detergents Market Size and Trends| Market...,DETERGENT,September23,0
1639,Portable Dishwasher Market Expected to Achieve...,DETERGENT,October 23,0
2457,Rural recovery for FMCG companies takes a paus...,DETERGENT,October 23,0
3154,Rural recovery for FMCG companies takes a paus...,DETERGENT,November 23,0
3155,Small FMCG brands get bigger as inflation cools,DETERGENT,November 23,0
10503,Plastic Packaging Eliminated With New Tide Det...,DETERGENT,April2024,0


In [1298]:
df[df['Sector'] == 'DETRGENTS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2713,FMCG Ebitda margins going strong despite slow ...,DETRGENTS,November 23,0


In [1299]:
# DETERGENTS
def detergents_sector(text):
    text=re.sub(r'DET\w*RGENT\w*','DETERGENTS',text)
    return text

df['Sector']=df['Sector'].apply(detergents_sector)

In [1300]:
len(df.Sector.unique())

99

In [1301]:
df[df['Sector'] == 'PORT']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
94,"India To Build 5 Fleet Support Ships Worth 20,...",PORT,August23,0
117,New Mangalore Port Authority’s profit expected...,PORT,August23,0
122,INCOIS launches ‘SAMUDRA’ mobile app for seafa...,PORT,August23,0
149,India's Deendayal Port and DP World Create ₹4....,PORT,September23,0
182,India allows exports of non-basmati white rice...,PORT,September23,0
...,...,...,...,...
11759,India Invests $370 Million In 10-Year Deal To ...,PORT,May2024,0
11760,"Indian Coast Guard, Hindalco tie-up for indige...",PORT,May2024,0
11810,India eyes more Chabahar-like pacts: Ports Min...,PORT,May2024,0
11812,"India, Iran sign 10-year contract for Chabahar...",PORT,May2024,0


### TELECOM

In [1302]:
df[df['Sector'] == 'TELCOM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
9001,Indian telecom industry’s revenue grew by Rs 1...,TELCOM,March2024,0
9002,Spectrum auction may see muted demand,TELCOM,March2024,0
9003,Cases of equipment thefts at all-time high; CO...,TELCOM,March2024,0
9004,Govt introduces spectrum regulatory sandbox fo...,TELCOM,March2024,0
9005,"India, US, South Korea explore cooperation in ...",TELCOM,March2024,0
9006,Vi eyes 40% of revenues from 5G in over two years,TELCOM,March2024,0
9007,"Airtel, Jio hand out handset offers for 5G tak...",TELCOM,March2024,0
11805,Telecom secretary pushes high-speed data for s...,TELCOM,May2024,0
11806,68% Indian firms rely on tech to drive sustain...,TELCOM,May2024,0
11807,Prototype promising sustainable energy future ...,TELCOM,May2024,0


In [1303]:
df[df['Sector'] == 'TELECOM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
97,Telecom services industry revenue to see 7-9 p...,TELECOM,August23,0
98,Trai exploring auction models to 'best' alloca...,TELECOM,August23,0
99,Cyber crime wing to use CEIR portal for tracin...,TELECOM,August23,0
100,Government will ensure orderly transition to n...,TELECOM,August23,0
101,"Telcos bat for 6 GHz spectrum band for 5G, 6G ...",TELECOM,August23,0
...,...,...,...,...
11481,Bacancy Successfully Manufactures India's Firs...,TELECOM,May2024,0
11829,World Telecom Day: India has 99% coverage with...,TELECOM,May2024,0
11830,Trai seeks inputs for broadcasting policy that...,TELECOM,May2024,0
11833,"Vi's Rs 18,000-cr fundraise will increase comp...",TELECOM,May2024,0


In [1304]:
# TELECOM
def telecom_sector(text):
    text=re.sub(r'TELCOM','TELECOM',text)
    return text

df['Sector']=df['Sector'].apply(telecom_sector)

In [1305]:
len(df.Sector.unique())

98

### HIGHWAY

In [1306]:
df[df['Sector'] == 'NHAI']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
110,CAG raises concerns over delegation of powers ...,NHAI,August23,0
111,"Assam, NHAI ink pact for increasing green cover",NHAI,August23,0
112,Funds allotted for NH-66 development in Kollam,NHAI,August23,0
113,Ensure early restoration of highways in Himach...,NHAI,August23,0
114,"NHAI, GAIL join forces to simplify infrastruct...",NHAI,August23,0
280,NHAI invites tender for constructing two vehic...,NHAI,September23,0
1076,Work on Dwarka expressway tunnel connecting Ha...,NHAI,September23,0
1078,Collect only 50% toll fee at Vagaikulam: HC to...,NHAI,September23,0
3876,NHAI begins work to upgrade service roads alon...,NHAI,December 23,0
3877,Keonjhar mishap: Odisha transport department p...,NHAI,December 23,0


In [1307]:
df[df['Sector'] == 'HIGHWAY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
115,Greenfield highway: takeover of land to be com...,HIGHWAY,August23,0
273,Odisha coastal highway Package 3 gets green cl...,HIGHWAY,September23,0
279,28% fall in fatalities on Muz-Kotwa NH-28,HIGHWAY,September23,0
285,Slow progress in highway construction persists...,HIGHWAY,September23,0
711,Bengaluru-Chennai express highway to start by ...,HIGHWAY,September23,0
...,...,...,...,...
9023,Nitin Gadkari lays foundation for 22 NH projec...,HIGHWAY,March2024,0
9413,NHAI monetises 889 km of national highways to ...,HIGHWAY,March2024,0
9849,Connectivity achieved by BRO for third axis to...,HIGHWAY,April2024,0
10505,Govt likely to change construction norm to ‘la...,HIGHWAY,April2024,0


In [1308]:
df[df['Sector'] == 'HIGHWAYS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1612,Revenue Roadmap: India Sets Sights On Monetisi...,HIGHWAYS,October 23,0
4201,Road Ministry demands 25% hike in budget alloc...,HIGHWAYS,December 23,0
4202,"NH totalling 43,856 km under construction stag...",HIGHWAYS,December 23,0
4203,Expansion of 20 national highways in Odisha pe...,HIGHWAYS,December 23,0
4576,Highways in India: 9 times increase in budget ...,HIGHWAYS,December 23,0
...,...,...,...,...
11799,Work on road connecting YXP & EPE stops as co ...,HIGHWAYS,May2024,0
11800,Kalunga ROB project nears completion; operatio...,HIGHWAYS,May2024,0
11802,Highway Ministry Front-Loads 20% Capex,HIGHWAYS,May2024,0
11803,NHAI secures 164 insurance bonds for road proj...,HIGHWAYS,May2024,0


In [1309]:
# HIGHWAY
def highway_sector(text):
    text=re.sub(r'NHAI','HIGHWAY',text)
    text=re.sub(r'HIGHWAYS','HIGHWAY',text)
    return text

df['Sector']=df['Sector'].apply(highway_sector)

In [1310]:
len(df.Sector.unique())

96

### PETROLEUM

In [1311]:
df[df['Sector'] == 'CRUDE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
150,"Govt cuts windfall tax on domestic crude, hike...",CRUDE,September23,0
2396,India slashes windfall tax on domestic crude f...,CRUDE,October 23,0


In [1312]:
df[df['Sector'] == 'PETROLEUM']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
52,Govt hikes windfall tax on petroleum crude to ...,PETROLEUM,August23,0
145,As rains dampen India’s domestic petroleum dem...,PETROLEUM,September23,0
3442,The Problem With Refilling The Strategic Petro...,PETROLEUM,November 23,0
6488,Indian Institute of Petroleum and Energy inks ...,PETROLEUM,January24,0
6492,State-run oil companies surge ahead with Rs 89...,PETROLEUM,January24,0
6493,Govt cuts windfall tax on petroleum crude to R...,PETROLEUM,January24,0
6494,Punjab petroleum dealers demand increase in ma...,PETROLEUM,January24,0
7823,Confidence Petroleum India JV with BW LPG to b...,PETROLEUM,February2024,0


In [1313]:
# PETROLEUM
def petroleum_sector(text):
    text=re.sub(r'CRUDE','PETROLEUM',text)
    return text

df['Sector']=df['Sector'].apply(petroleum_sector)

In [1314]:
len(df.Sector.unique())

95

### SUSTAINABILITY

In [1315]:
df[df['Sector'] == 'SUSTAINABILITY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
156,Transition finance can balance profitability w...,SUSTAINABILITY,September23,0
200,‘Bridge elements’ – A necessity in India’s dec...,SUSTAINABILITY,September23,0
228,Electrifying last-mile deliveries for a more s...,SUSTAINABILITY,September23,0
738,India Inc rides EV wave in sustainability push,SUSTAINABILITY,September23,0
739,‘India can emerge as refuelling destination fo...,SUSTAINABILITY,September23,0
759,G20 Achieves Green Development Pact For Global...,SUSTAINABILITY,September23,0
760,Meet the sustainability champions of Visakhapa...,SUSTAINABILITY,September23,0
1123,"Net zero target: Oil India to invest Rs 25,000...",SUSTAINABILITY,September23,0
1207,Developing sustainable mobility ecosystem in h...,SUSTAINABILITY,September23,0
2014,India to push developed nations to become 'car...,SUSTAINABILITY,October 23,0


In [1316]:
df[df['Sector'] == 'SUSTAINABLE']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
2666,What does it take to decarbonise India's indus...,SUSTAINABLE,November 23,0


In [1317]:
# SUSTAINABILITY
def sustainability_sector(text):
    text=re.sub(r'SUSTAINABLE','SUSTAINABILITY',text)
    return text

df['Sector']=df['Sector'].apply(sustainability_sector)

In [1318]:
len(df.Sector.unique())

94

### TECHNOLOGY

In [1319]:
df[df['Sector'] == 'TECH']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1924,Cabinet approves India's digital tech pacts wi...,TECH,October 23,0


In [1320]:
df[df['Sector'] == 'TEECH']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
212,"Integrated digitalisation: Reduces costs, addr...",TEECH,September23,0


In [1321]:
df[df['Sector'] == 'TECHNOLOGY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1506,How AI is helping organisations build future-r...,TECHNOLOGY,October 23,1
1890,Cutting-edge technologies are revolutionising ...,TECHNOLOGY,October 23,1
1904,Farmer collectives off to a quick start on OND...,TECHNOLOGY,October 23,1
1927,India's navigation system means business: NavI...,TECHNOLOGY,October 23,1
2227,Driving into the future: the impact of connect...,TECHNOLOGY,October 23,1
2419,Tata Motors picks up nearly 27% in Freight Tig...,TECHNOLOGY,October 23,1
2502,HPCL and Zupple Labs collab to develop blockch...,TECHNOLOGY,October 23,1
2560,Can AI-powered voice payments for UPI transact...,TECHNOLOGY,October 23,1
5279,"In two years, 70% smartphone users may switch ...",TECHNOLOGY,Dec23 Shortlist,1
5319,Mappls KOGO and Zoomcar partner to elevate tra...,TECHNOLOGY,Dec23 Shortlist,1


In [1322]:
# TECHNOLOGY
def technology_sector(text):
    text=re.sub(r'TE\w*CH\b','TECHNOLOGY',text)
    return text

df['Sector']=df['Sector'].apply(technology_sector)

In [1323]:
len(df.Sector.unique())

92

### PAINT

In [1324]:
df[df['Sector'] == 'PAINT']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
262,Paint industry evolves: New players and acquis...,PAINT,September23,0
1907,"OMCs, paint and tyre stocks slip as crude oil ...",PAINT,October 23,0
2332,Nippon Paint India plans to increase auto refi...,PAINT,October 23,0
2333,India's Grasim to raise $481 mln to pay down d...,PAINT,October 23,0
2582,"Asian Paints, Berger, Nerolac: Morgan Stanley ...",PAINT,November 23,0
3007,Asian Paints’ net up 54% despite flat revenue ...,PAINT,November 23,0
3037,Nippon Paint India plans to increase auto refi...,PAINT,November 23,0
3400,Insulated Paint Market: Driving Sustainability...,PAINT,November 23,0
4524,Nippon Paint India acquires Vibgyor Paints and...,PAINT,December 23,1
4525,Nippon Paint announces acquisition of VIBGYOR ...,PAINT,December 23,0


In [1325]:
df[df['Sector'] == 'PAINTS']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1170,Grasim to launch paints biz by early next yr |...,PAINTS,September23,0
1171,Grasim eyes paint business entry in Q4 to step...,PAINTS,September23,0
1172,Paint Industry Shades Of Success,PAINTS,September23,0
11747,Berger Paints India Q4 net profit grows 19.68 ...,PAINTS,May2024,0
11748,Berger Paints records nearly 20% YoY rise in Q...,PAINTS,May2024,0


In [1326]:
# PAINT
def paint_sector(text):
    text=re.sub(r'PAINTS','PAINT',text)
    return text

df['Sector']=df['Sector'].apply(paint_sector)

In [1327]:
len(df.Sector.unique())

91

In [1328]:
df[df['Sector'] == 'CORE ']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
8251,Growth rate of eight core sectors slows down t...,CORE,March2024,0


In [1329]:
df[df['Sector'] == 'MOBILITY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
295,Rideshares no silver bullet for urban India’s ...,MOBILITY,September23,0
602,Open mobility network to be expanded,MOBILITY,September23,0
2581,Building last-mile electrification for sustain...,MOBILITY,November 23,0
3151,Three steps to sustainable mobility,MOBILITY,November 23,0
3512,The potential and the potholes in India’s jour...,MOBILITY,November 23,0
7045,Euler Motors & Magenta Mobility Scale Up Partn...,MOBILITY,February2024,0


In [1330]:
df[df['Sector'] == 'ELECTRIC MOBILITY']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
1572,Jharkhand's lithium finds all set to power Ind...,ELECTRIC MOBILITY,October 23,1
1640,India to start auction of critical mineral min...,ELECTRIC MOBILITY,October 23,1
1901,Li-ion battery recycling: booming market with ...,ELECTRIC MOBILITY,October 23,1
2553,India's EV revolution: Dashboard predicts 45.5...,ELECTRIC MOBILITY,October 23,1
2554,Free-for-all EV-Ready India dashboard launched...,ELECTRIC MOBILITY,October 23,1
2555,Indraprastha Gas Ltd stock nosedives on New EV...,ELECTRIC MOBILITY,October 23,1
2556,"Fuelled by EV prospects, Indian auto sector se...",ELECTRIC MOBILITY,October 23,1
2557,SUN Mobility sets sights on the heavy-duty veh...,ELECTRIC MOBILITY,October 23,1
2558,Tata Power runs EV charging show for fans pull...,ELECTRIC MOBILITY,October 23,1


In [1331]:
df[df['Sector'] == 'EV']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
4,Andhra Pradesh's Avera making waves on global ...,EV,August23,0
93,Bharat Charge Alliance and CHAdeMO Association...,EV,August23,0
160,JSW Group in talks with Chinese carmaker Leap ...,EV,September23,0
231,Formulation of a safety ecosystem for growing ...,EV,September23,0
275,"Govt to stagger ₹57,613 cr e-bus scheme invest...",EV,September23,0
287,Nextev policy to focus on enhancing charging i...,EV,September23,0
719,The Chaos In India's EV Charging Ecosystem,EV,September23,0
1193,"Pune, Pimpri Chinchwad register 30% of all EV ...",EV,September23,0
1194,Govt gears up to introduce revised version of ...,EV,September23,0
1967,BYD execs fail to get Indian visas | Mint,EV,October 23,0


### FREIGHT

In [1332]:
df[df['Sector'] == 'FIREIGHT']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
11436,Rail freight grows at 1.4% in April; coal carg...,FIREIGHT,May2024,0


In [1333]:
df[df['Sector'] == 'FREIGHT CORRIDOR']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
668,The Middle-East trade-tech corridor will bring...,FREIGHT CORRIDOR,September23,0
722,Dedicated freight corridor 65% operational; to...,FREIGHT CORRIDOR,September23,0
2632,"4 girders weighing 1,548 tonnes launched on DF...",FREIGHT CORRIDOR,November 23,0
2633,"Engie India to invest Rs 3,500 cr for 700-MW r...",FREIGHT CORRIDOR,November 23,0
2637,India commences operations on Eastern Freight ...,FREIGHT CORRIDOR,November 23,1
3485,DFCCIL achieves new milestone with 3rd tunnel ...,FREIGHT CORRIDOR,November 23,0
3486,Dedicated Freight Corridor Corp reveals five s...,FREIGHT CORRIDOR,November 23,0
3487,DFCCIL conducts electric loco trial on 27 km t...,FREIGHT CORRIDOR,November 23,0
3488,Why IMEC shouldn't be a missed opportunity,FREIGHT CORRIDOR,November 23,0
5001,Modi dedicates Pandit Deen Dayal Upadhyay-Bhau...,FREIGHT CORRIDOR,December 23,0


In [1334]:
# FREIGHT
def freight_sector(text):
    text=re.sub(r'F\w*IGHT\s*\w*','FREIGHT',text)
    return text

df['Sector']=df['Sector'].apply(freight_sector)

In [1335]:
len(df.Sector.unique())

90

### PRODUCTION

In [1336]:
df[df['Sector'] == 'PROCTION']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
4958,How India Is Emerging As A Production Hub For ...,PROCTION,December 23,0


In [1337]:
df[df['Sector'] == 'PRODUCTION']

Unnamed: 0,Links,Sector,Sheet,Shortlisted
4549,Factory output: IIP rises to 16-month high of ...,PRODUCTION,December 23,0


In [1338]:
# PRODUCTION
def production_sector(text):
    text=re.sub(r'PROCTION','PRODUCTION',text)
    return text

df['Sector']=df['Sector'].apply(production_sector)

In [1339]:
len(df.Sector.unique())

89

In [1340]:
df.Sector.unique()

array(['TRANSPORT', 'LOGISTICS', 'EV', 'INFRASTRUCTURE', 'TRADE',
       'ELECTRONICS', 'GREEN ENERGY', 'AGRICULTURE', 'TOURISM', 'PHARMA',
       'FOOD', 'HOTEL', 'RETAIL', 'HOSPITALITY', 'AVIATION', 'TRAVEL',
       'RENEWABLES', 'MINING', 'PETROLEUM', 'AIRPORT', 'COAL', 'BIOFUEL',
       'STEEL', 'ADHESIVES', 'DETERGENTS', 'PORT', 'CONSTRUCTION',
       'TELECOM', 'HIGHWAY', 'ROAD', 'BUSINESS', 'MARINE', 'POWER',
       'INFRASTRUCTURE ', 'COOKING', 'ECONOMY', 'FMCG', 'FUEL',
       'SUSTAINABILITY', 'PETCHEM', 'OIL', 'POLICY', 'TECHNOLOGY',
       'MARKET', 'AUTO', '', 'MANUFACTURING', 'PAINT', 'CORRIDOR',
       'MOBILITY', 'SUGAR', 'CARBON EMISSION', 'HYDROCARBON', 'FREIGHT',
       'OMC', 'SHIPPING', 'BUZZ', 'GDP', 'NATURAL GAS',
       'ELECTRIC MOBILITY', 'COMPETITOR', 'PETRO', 'I&C', 'RETAIL & I&C',
       'OIL & GAS', 'LPG', 'IT', 'HYDROGEN', 'PARTING SHOTS',
       'LUBRICANTS', 'GAS', 'INDUSTRY', 'HUB', 'VEHICLES', 'AIRLINE',
       'PRODUCTION', 'EV/RETAIL', 'RETAIL, LPG'

### Assigning multiple sectors

In [1341]:
def renaming_mulitple_sectors(text):
    if text!='' and (' & ' in text or ', ' in text or '/' in text):
        if 'RETAIL' in text:
            text='RETAIL'
        elif 'COMPETITOR' in text:
            text='COMPETITOR'
        else:
            text='OIL'
    return text

df['Sector']=df['Sector'].apply(renaming_mulitple_sectors)

In [1342]:
def checking_split(text):
    if text!='' and (',' in text or '&' in text or '/' in text):
        print(text)
    return

df.Sector.apply(checking_split)

I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C
I&C


0        None
1        None
2        None
3        None
4        None
         ... 
12048    None
12049    None
12050    None
12051    None
12052    None
Name: Sector, Length: 12008, dtype: object

In [1346]:
import numpy as np
df.Sector=df.Sector.replace(r'', np.nan, regex=True)          # Assigning empty strings back to Nan

In [1347]:
df.Sector.value_counts()

Sector
TELECOM            279
INFRASTRUCTURE     266
AVIATION           259
ELECTRONICS        238
LOGISTICS          236
                  ... 
PETRO                1
GDP                  1
BUZZ                 1
CARBON EMISSION      1
CHEMICAL             1
Name: count, Length: 80, dtype: int64

In [1348]:
df.Sector.unique()

array(['TRANSPORT', 'LOGISTICS', 'EV', 'INFRASTRUCTURE', 'TRADE',
       'ELECTRONICS', 'GREEN ENERGY', 'AGRICULTURE', 'TOURISM', 'PHARMA',
       'FOOD', 'HOTEL', 'RETAIL', 'HOSPITALITY', 'AVIATION', 'TRAVEL',
       'RENEWABLES', 'MINING', 'PETROLEUM', 'AIRPORT', 'COAL', 'BIOFUEL',
       'STEEL', 'ADHESIVES', 'DETERGENTS', 'PORT', 'CONSTRUCTION',
       'TELECOM', 'HIGHWAY', 'ROAD', 'BUSINESS', 'MARINE', 'POWER',
       'INFRASTRUCTURE ', 'COOKING', 'ECONOMY', 'FMCG', 'FUEL',
       'SUSTAINABILITY', 'PETCHEM', 'OIL', 'POLICY', 'TECHNOLOGY',
       'MARKET', 'AUTO', nan, 'MANUFACTURING', 'PAINT', 'CORRIDOR',
       'MOBILITY', 'SUGAR', 'CARBON EMISSION', 'HYDROCARBON', 'FREIGHT',
       'OMC', 'SHIPPING', 'BUZZ', 'GDP', 'NATURAL GAS',
       'ELECTRIC MOBILITY', 'COMPETITOR', 'PETRO', 'I&C', 'LPG', 'IT',
       'HYDROGEN', 'PARTING SHOTS', 'LUBRICANTS', 'GAS', 'INDUSTRY',
       'HUB', 'VEHICLES', 'AIRLINE', 'PRODUCTION', 'FLEET', 'DEFENCE',
       'ETHANOL', 'CORE ', 'DERIVATIVES',

# BERT (model implementation)

In [1349]:
train_df=df[~(df['Sheet'].str.contains('April2024', regex=True) | df['Sheet'].str.contains('May2024', regex=True))]
test_df=df[len(train_df):]

In [1350]:
len(train_df)

9758

In [1351]:
len(test_df)

2250

In [1358]:
df.columns

Index(['Links', 'Sector', 'Sheet', 'Shortlisted', 'label'], dtype='object')

In [1377]:
df.Shortlisted=pd.to_numeric(df.Shortlisted)

In [1379]:
df.Shortlisted.value_counts()

Shortlisted
0    11608
1      400
Name: count, dtype: int64

In [1392]:
x_train,x_test,y_train,y_test=train_df.Links.tolist(),train_df.Sector.tolist(),test_df.Links.tolist(),test_df.Sector.tolist()

In [1372]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch

In [1395]:
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(train_df['Links'].tolist(), train_df['Sector'].tolist(), test_size=0.2)

In [1396]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(texts):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=128)

train_encodings = tokenize_function(train_texts)
test_encodings = tokenize_function(test_texts)

In [1397]:
# Create a PyTorch Dataset
class NewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NewsDataset(train_encodings, train_labels)
test_dataset = NewsDataset(test_encodings, test_labels)

In [1402]:
# Load pre-trained BERT model with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=10)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [1403]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=8,   # batch size for training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch"
)

ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

In [1400]:
# Initialize Trainer
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset             # evaluation dataset
)

NameError: name 'training_args' is not defined

In [1401]:
# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

# Save the fine-tuned model
model.save_pretrained('./fine-tuned-bert')
tokenizer.save_pretrained('./fine-tuned-bert')

NameError: name 'trainer' is not defined

## Noise reduction

### Lowercasing

In [155]:
df['Links']=df['Links'].str.lower()

In [156]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,india can usher in the beginning of the end of...,[TRANSPORT],August23,0
1,"volvo india to bid in govt’s e-bus programme, ...",[TRANSPORT],August23,0
2,optimizing warehouse space: how asrs maximizes...,[LOGISTICS],August23,0
3,bharat ncap to be launched by nitin gadkari ne...,[TRANSPORT],August23,0
4,andhra pradesh's avera making waves on global ...,[EV],August23,0


### Removing apostrophe s ('s)

In [157]:
# "'s" would negatively affect POS Tagging
df['Links'][1]

'volvo india to bid in govt’s e-bus programme, says company chief'

In [158]:
df['Links']=df['Links'].str.replace("'s", "")

In [159]:
df['Links']=df['Links'].str.replace("’s", "")

In [160]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,india can usher in the beginning of the end of...,[TRANSPORT],August23,0
1,"volvo india to bid in govt e-bus programme, sa...",[TRANSPORT],August23,0
2,optimizing warehouse space: how asrs maximizes...,[LOGISTICS],August23,0
3,bharat ncap to be launched by nitin gadkari ne...,[TRANSPORT],August23,0
4,andhra pradesh avera making waves on global e-...,[EV],August23,0


### Removing accented characters

In [161]:
from unidecode import unidecode
df['Links'] = df['Links'].apply(unidecode)

In [162]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,india can usher in the beginning of the end of...,[TRANSPORT],August23,0
1,"volvo india to bid in govt e-bus programme, sa...",[TRANSPORT],August23,0
2,optimizing warehouse space: how asrs maximizes...,[LOGISTICS],August23,0
3,bharat ncap to be launched by nitin gadkari ne...,[TRANSPORT],August23,0
4,andhra pradesh avera making waves on global e-...,[EV],August23,0


### Removing special characters

In [163]:
import string
exclude = string.punctuation
exclude

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [164]:
# Function to remove punctuation from text
def remove_sp(text):
    return text.translate(str.maketrans('', '', exclude))

df['Links'] = df['Links'].apply(remove_sp)

In [165]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,india can usher in the beginning of the end of...,[TRANSPORT],August23,0
1,volvo india to bid in govt ebus programme says...,[TRANSPORT],August23,0
2,optimizing warehouse space how asrs maximizes ...,[LOGISTICS],August23,0
3,bharat ncap to be launched by nitin gadkari ne...,[TRANSPORT],August23,0
4,andhra pradesh avera making waves on global es...,[EV],August23,0


## Tokenization (word)

In [166]:
# Tokenizing into words
from nltk.tokenize import word_tokenize
df['Links'] = df['Links'].apply(word_tokenize)

In [167]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,"[india, can, usher, in, the, beginning, of, th...",[TRANSPORT],August23,0
1,"[volvo, india, to, bid, in, govt, ebus, progra...",[TRANSPORT],August23,0
2,"[optimizing, warehouse, space, how, asrs, maxi...",[LOGISTICS],August23,0
3,"[bharat, ncap, to, be, launched, by, nitin, ga...",[TRANSPORT],August23,0
4,"[andhra, pradesh, avera, making, waves, on, gl...",[EV],August23,0


## Normalization

### POS Tagging

In [168]:
import nltk
from nltk import pos_tag

In [169]:
# Downloading necessary NLTK data files
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [170]:
# POS Tagging
df['Links'] = df['Links'].apply(pos_tag)

In [171]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,"[(india, NN), (can, MD), (usher, VB), (in, IN)...",[TRANSPORT],August23,0
1,"[(volvo, NN), (india, NN), (to, TO), (bid, VB)...",[TRANSPORT],August23,0
2,"[(optimizing, VBG), (warehouse, NN), (space, N...",[LOGISTICS],August23,0
3,"[(bharat, NN), (ncap, NN), (to, TO), (be, VB),...",[TRANSPORT],August23,0
4,"[(andhra, NN), (pradesh, JJ), (avera, NN), (ma...",[EV],August23,0


### Lemmatization

In [172]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [173]:
def lemmatize_pos_tags(pos_tags):
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = []
    for token, pos_tag in pos_tags:
        
        '''
        Convert POS tags to WordNet POS tags.
        For each tuple, it determines the appropriate WordNet POS tag (wn_pos_tag) based
        on the Penn Treebank POS tag (pos_tag). It checks if the POS tag starts with 'N' for nouns,
        'V' for verbs, 'J' for adjectives, or 'R' for adverbs.
        '''
        wn_pos_tag = nltk.corpus.wordnet.NOUN if pos_tag.startswith('N') else \
                     nltk.corpus.wordnet.VERB if pos_tag.startswith('V') else \
                     nltk.corpus.wordnet.ADJ if pos_tag.startswith('J') else \
                     nltk.corpus.wordnet.ADV if pos_tag.startswith('R') else None
        
        if wn_pos_tag:
            
            #If an appropriate WordNet POS tag is found (wn_pos_tag is not None), it lemmatizes the token
            lemmatized_token = lemmatizer.lemmatize(token, pos=wn_pos_tag)
            lemmatized_tokens.append(lemmatized_token)
        else:
            
            #If no appropriate WordNet POS tag is found, use the token as is
            lemmatized_tokens.append(token)
    return lemmatized_tokens

df['Links'] = df['Links'].apply(lemmatize_pos_tags)

In [174]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,"[india, can, usher, in, the, beginning, of, th...",[TRANSPORT],August23,0
1,"[volvo, india, to, bid, in, govt, ebus, progra...",[TRANSPORT],August23,0
2,"[optimize, warehouse, space, how, asrs, maximi...",[LOGISTICS],August23,0
3,"[bharat, ncap, to, be, launch, by, nitin, gadk...",[TRANSPORT],August23,0
4,"[andhra, pradesh, avera, make, wave, on, globa...",[EV],August23,0


## Removing digits

In [175]:
# Removing digits before POS Tagging could have affected it negatively
import re

def remove_digits_from_tokens(tokens):
    
    # Regular expression pattern to match the digits
    digit_pattern = re.compile(r'\d+')
    
    # Remove digits from each token in the list
    tokens_without_digits = [re.sub(digit_pattern, '', token) for token in tokens]
    
    return tokens_without_digits

df['Links'] = df['Links'].apply(remove_digits_from_tokens)

In [176]:
# Contains empty tokens
df['Links'][11]

['',
 'high',
 'natural',
 'rubber',
 'production',
 'in',
 'fy',
 'all',
 'india',
 'rubber',
 'industry',
 'association']

In [177]:
# Function to remove empty tokens from a list of tokens
def remove_empty_tokens(tokens):
    return [token for token in tokens if token]

df['Links'] = df['Links'].apply(remove_empty_tokens)

In [178]:
# Successfully removed empty tokens
df['Links'][11]

['high',
 'natural',
 'rubber',
 'production',
 'in',
 'fy',
 'all',
 'india',
 'rubber',
 'industry',
 'association']

## Stopword removal

In [179]:
from nltk.corpus import stopwords

# Download the stopwords corpus
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [180]:
# Get the English stopwords list
stopwords_list = set(stopwords.words('english'))

# Function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [token for token in tokens if token.lower() not in stopwords_list]

df['Links'] = df['Links'].apply(remove_stopwords)

In [181]:
df.head()

Unnamed: 0,Links,Sector,Sheet,Shortlisted
0,"[india, usher, beginning, end, ice, age, globa...",[TRANSPORT],August23,0
1,"[volvo, india, bid, govt, ebus, programme, say...",[TRANSPORT],August23,0
2,"[optimize, warehouse, space, asrs, maximizes, ...",[LOGISTICS],August23,0
3,"[bharat, ncap, launch, nitin, gadkari, next, w...",[TRANSPORT],August23,0
4,"[andhra, pradesh, avera, make, wave, global, e...",[EV],August23,0
